Apache NiFi for HPC System Administrators: Deep Dive Reference

1. Overview

This deep-dive reference covers advanced Apache NiFi topics for HPC system administrators: clustering architecture, integration with HPC schedulers (Slurm) and storage systems (BeeGFS, Ceph), containerized deployments (Docker, Kubernetes), flow version control (Registry and Git), security hardening, and comprehensive monitoring strategies.

This guide assumes familiarity with basic NiFi concepts (FlowFile, Processor, Connection, Expression Language) covered in the [[apache-nifi-hpc-sysadmin-beginner-guide|Beginner Guide]]. It is designed as a reference you can return to as you design and maintain production HPC dataflow pipelines.

2. Prerequisites

Before working through this guide, you should have:

A working standalone NiFi 2.x installation (from the Beginner Guide)
Comfortable with Linux system administration (users, groups, file permissions, systemd)
Access to HPC infrastructure or simulators:
- Slurm cluster (or ability to run slurmrestd standalone)
- BeeGFS filesystem (or equivalent POSIX-compliant parallel filesystem)
- Ceph cluster with S3 RadosGW or access to AWS S3
Docker and Docker Compose installed (for clustered testing)
Kubernetes cluster (for K8s deployment section; minikube or a test cluster acceptable)
Git repository for flow version control (GitHub, GitLab, Bitbucket, or local)
Text editor or IDE for configuration files and scripts

3. Key Concepts

3.1 NiFi Core Abstractions

Expression Language (EL)

NiFi Expression Language enables dynamic processor configuration using FlowFile attributes. Syntax: ${attribute_name} or ${attribute_name:function()}.

Common patterns for HPC:

# Extract job ID from Slurm output
${slurm.output:substringAfter('job')
                    :substringBefore(':')}

# Construct S3 key with date partitioning
jobs/${job.id}/${now():format('yyyy/MM/dd')}/${filename}

# Conditional routing by job status
${job.status:equals('COMPLETED')}

# Test file size threshold
${fileSize:gt(1073741824)}  # > 1 GB

# Dynamic SFTP host from attribute
sftp://${sftp.remote.host}:${sftp.remote.port}/${path}${filename}

Function categories:

String: toUpper(), toLower(), trim(), substringBefore(), substringAfter(), length(), substring(), replace(), split()
Numeric: lt(), gt(), le(), ge(), equals(), plus(), minus()
Boolean: matches(), contains(), startsWith(), endsWith()
Date: now(), format(), toDate(), fromDate()
Type conversion: toNumber(), toBoolean()

3.2 Cluster Architecture

A NiFi cluster consists of:

ZooKeeper: Distributed coordination service for cluster state (leader election, configuration sync).
Multiple NiFi nodes: Each node can process FlowFiles independently.
Cluster Coordinator: Elected by ZooKeeper; manages cluster-wide state and flow propagation.
Primary Node: A separately elected role; only processes that require single-point coordination (ListFile, ListSFTP, ListS3) run here.

Key behaviors:

Flow synchronization: The Cluster Coordinator syncs the flow definition to all nodes.
Queue replication: Connection queues are replicated across nodes for fault tolerance.
Load balancing: ProcessGroups can distribute load across available nodes.
Backpressure: If a node's queue fills (by data size or FlowFile count), upstream nodes pause.

3.3 Data Provenance Architecture

NiFi maintains a complete audit trail in the provenance repository:

Every FlowFile creation, read, write, route, fork, join, drop event is recorded.
Records include: processor ID, timestamp, FlowFile UUID, attribute snapshots, lineage.
Stored on disk in configurable location (default: ./provenance_repository/).
Queryable via the Data Provenance UI or REST API.
Replay feature: Resend a FlowFile from any point in its history to re-run processing.

4. Step-by-Step Instructions

4.1 Installing NiFi as a Systemd Service

For production deployments, run NiFi as a managed systemd service for automatic restart and clean shutdown:

# Install as a service
/opt/nifi/bin/nifi.sh install

# Start the service
systemctl start nifi

# Enable auto-start on boot
systemctl enable nifi

# Check status
systemctl status nifi

# View logs
journalctl -u nifi -f

Expected output:

● nifi.service - Apache NiFi
   Loaded: loaded (/opt/nifi/bin/nifi > loaded to systemd)
   Active: active (running) since Thu 2026-04-10 12:00:00 UTC

4.2 Configuring NiFi for Production

Edit /opt/nifi/conf/nifi.properties:

# Web UI security (HTTPS required for cluster)
nifi.web.https.host=nifi1.hpc.example.com
nifi.web.https.port=8443

# TLS keystore/truststore (generated by tls-toolkit.sh)
nifi.security.keystore=./conf/keystore.p12
nifi.security.keystoreType=PKCS12
nifi.security.keystorePasswd=<generated-password>
nifi.security.truststore=./conf/truststore.p12
nifi.security.truststoreType=PKCS12
nifi.security.trusstorePasswd=<generated-password>

# Authentication
nifi.security.user.login.identity.provider=single-user-provider

# Encryption of sensitive properties
nifi.sensitive.props.key=<12+ character passphrase>
nifi.sensitive.props.algorithm=NIFI_PBKDF2_AES_GCM_256

# Repository paths (can point to fast/large storage)
nifi.content.repository.directory.default=/data/nifi/content_repository
nifi.flowfile.repository.directory=/data/nifi/flowfile_repository
nifi.provenance.repository.directory.default=/data/nifi/provenance_repository
nifi.database.repository.directory=./database_repository

# Provenance storage limits
nifi.provenance.repository.max.storage.size=50 GB
nifi.provenance.repository.max.storage.time=30 days
nifi.provenance.repository.rollover.time=10 mins

# JVM memory settings (in bootstrap.conf)
# java.arg.2=-Xms4g
# java.arg.3=-Xmx16g

# Cluster mode (set per-node)
nifi.cluster.is.node=false  # Set true on each node
nifi.cluster.node.address=nifi1.hpc.example.com
nifi.cluster.node.protocol.port=11443
nifi.zookeeper.connect.string=zk1.hpc.example.com:2181,zk2.hpc.example.com:2181,zk3.hpc.example.com:2181
nifi.cluster.flow.election.max.candidates=3
nifi.cluster.protocol.is.secure=true

4.3 Generating TLS Certificates for Cluster

Use NiFi's TLS Toolkit to generate proper certificates:

# Generate certificates for all nodes
/opt/nifi/bin/tls-toolkit.sh standalone \
  -n "nifi1.hpc.example.com,nifi2.hpc.example.com,nifi3.hpc.example.com" \
  -C "CN=admin,OU=NiFi" \
  -o ./nifi-certs

# Output: nifi-certs/nifi1.hpc.example.com/, nifi2.hpc.example.com/, etc.
# Each contains: nifi.properties, keystore.p12, truststore.p12

For each node:

Copy the node's directory contents to /opt/nifi/conf/
Update nifi.properties with the node address and ZK quorum
Set proper permissions: chmod 600 conf/keystore.p12 conf/truststore.p12

4.4 Building a Slurm Integration Pipeline

Complete workflow: Monitor input → Submit Slurm job → Poll status → Process output → Archive results

Components:

Input Monitor — ListFile watching /scratch/jobs/input/
Job Submission — ExecuteProcess running sbatch
Job ID Extraction — ExtractText parsing sbatch output
Status Polling — ExecuteStreamCommand running squeue on a loop
Routing — RouteOnAttribute branching on job status (RUNNING/COMPLETED/FAILED)
Output Processing — ListFile monitoring job output directory
Archive — PutS3Object uploading to Ceph
Notification — PutSlack or PutEmail on completion

Configuration example:

[ListFile: /scratch/jobs/input/]
  Input Directory: /scratch/jobs/input/
  File Filter: .*\.job$
  Minimum File Age: 30 sec
  --> success
[UpdateAttribute]
  job.submitted.time = ${now():format('yyyy-MM-dd HH:mm:ss')}
  job.input.file = ${absolute.path}
  --> [ExecuteProcess: sbatch]
  Command: /usr/bin/sbatch
  Arguments: --job-name=${filename:substringBefore('.')} --time=02:00:00 \
             --export=INPUT_FILE=${job.input.file} \
             /opt/scripts/process.sh
  --> success
[ExtractText]
  Property: slurm.job.id
  Value (regex): Submitted batch job (\d+)
  --> success
[Wait]
  Wait Duration: 30 sec
  --> [ExecuteStreamCommand: squeue polling loop]
  Command: /bin/bash
  Arguments: -c "squeue --job=${slurm.job.id} --format='%T' --noheader"
  --> success
[ExtractText]
  Property: slurm.status
  Value: (RUNNING|PENDING|COMPLETED|FAILED)
  --> 
[RouteOnAttribute]
  "completed": ${slurm.status:equals('COMPLETED')}
  "failed": ${slurm.status:equals('FAILED')}
  "pending": ${slurm.status:matches('RUNNING|PENDING')}

  completed --> [ListFile: /scratch/jobs/${slurm.job.id}/output/]
                --> [FetchFile]
                --> [UpdateAttribute]
                  s3.key = jobs/${slurm.job.id}/${now():format('yyyy/MM/dd')}/${filename}
                  --> [PutS3Object]
                  Bucket: job-results
                  Endpoint: http://ceph-rgw.hpc.example.com:7480
                  --> success
                --> [PutSlack]
                Message: Job ${slurm.job.id} completed!

  failed --> [PutSlack]
             Message: Job ${slurm.job.id} FAILED

  pending --> [RouteOnAttribute: decision point, feeds back to squeue loop]

4.5 BeeGFS Integration Pipeline

BeeGFS is a POSIX-compliant parallel filesystem. No special integration needed — use standard file processors:

# Monitor BeeGFS for new data
ListFile:
  Input Directory: /beegfs/scratch/experiments/
  File Filter: .*\.h5$       # HDF5 files
  Minimum File Age: 60 sec   # Files still being written
  Recurse Subdirectories: true

# Read file content
FetchFile:
  File to Fetch: ${absolute.path}
  Completion Strategy: Move to Directory
  Move Destination: /beegfs/scratch/processing/

# Transform path for S3 upload
UpdateAttribute:
  s3.bucket: hpc-data
  s3.key: raw/${now():format('yyyy/MM/dd')}/${filename}

# Upload to Ceph
PutS3Object:
  Bucket: ${s3.bucket}
  Object Key: ${s3.key}
  Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
  AWS Credentials Provider: CephCredentials

Performance considerations:

Set Minimum File Age to prevent reading files still being written (typically 30-60 seconds)
For large files (>1 GB), ensure NiFi heap is sized appropriately (content repository holds file in memory temporarily)
Consider keeping NiFi content repository on fast local NVMe, separate from monitored BeeGFS paths
Use FetchFile Move to Directory completion strategy to atomically move processed files, preventing re-ingestion

4.6 Ceph S3 Integration with AWSCredentialsProviderControllerService

Setup Controller Service:

Right-click canvas → Configure → Controller Services tab
Add: AWSCredentialsProviderControllerService

Configure:

Access Key ID: <ceph-rgw-access-key>
Secret Access Key: <ceph-rgw-secret-key>

Click lightning bolt to enable

Processor configuration:

For PutS3Object:

Bucket: hpc-results
Region: us-east-1   # Ceph uses us-east-1 as default
Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
AWS Credentials Provider Service: AWSCredentialsProviderControllerService
Use Path Style Access: true    # REQUIRED for Ceph
Multipart Upload Threshold: 5 GB
Multipart Part Size: 100 MB
Content Type: ${mime.type}

For ListS3:

Bucket: hpc-raw-data
Prefix: incoming/2026/
Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
AWS Credentials Provider Service: AWSCredentialsProviderControllerService

For FetchS3Object:

Bucket: ${s3.bucket}
Object Key: ${s3.key}
Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
AWS Credentials Provider Service: AWSCredentialsProviderControllerService

Complete BeeGFS → Slurm → Ceph pipeline:

[ListFile: /beegfs/raw-data/] (Primary Node, 5 min)
  --> [FetchFile: Move to /beegfs/in-processing/]
  --> [UpdateAttribute]
      s3.key = raw/${now():format('yyyy/MM/dd')}/${filename}
      s3.bucket = hpc-input
      slurm.input_s3_key = ${s3.key}
  --> [PutS3Object: Endpoint=http://ceph-rgw:7480]
  --> [UpdateAttribute]
      slurm.script = /opt/scripts/process.sh
  --> [ExecuteProcess: sbatch]
      sbatch --export=INPUT_S3_KEY=${slurm.input_s3_key} \
             ${slurm.script}
  --> [ExtractText: slurm.job.id = Submitted batch job (\d+)]
  --> [Wait + squeue poll loop]
  --> [RouteOnAttribute: completed/failed]
  --> [PutSlack: "Pipeline complete: ${slurm.job.id}"]

4.7 Docker Compose Multi-Node Cluster

File: docker-compose.yml

version: '3.8'

services:
  zookeeper:
    image: bitnami/zookeeper:3.8
    hostname: zookeeper
    container_name: zookeeper
    ports:
      - "2181:2181"
    environment:
      ALLOW_ANONYMOUS_LOGIN: "yes"
      ZOO_MY_ID: "1"
    volumes:
      - zookeeper_data:/bitnami/zookeeper

  nifi:
    image: apache/nifi:2.8.0
    ports:
      - "8080-8082:8080"   # Maps host ports 8080-8082 to container 8080
    environment:
      NIFI_WEB_HTTP_PORT: "8080"
      NIFI_CLUSTER_IS_NODE: "true"
      NIFI_CLUSTER_NODE_PROTOCOL_PORT: "8082"
      NIFI_ZK_CONNECT_STRING: "zookeeper:2181"
      NIFI_ELECTION_MAX_WAIT: "1 min"
      NIFI_ELECTION_MAX_CANDIDATES: "3"
      SINGLE_USER_CREDENTIALS_USERNAME: "admin"
      SINGLE_USER_CREDENTIALS_PASSWORD: "adminpassword123"
    depends_on:
      - zookeeper
    volumes:
      - nifi_content:/opt/nifi/nifi-current/content_repository
      - nifi_db:/opt/nifi/nifi-current/database_repository
      - nifi_flowfile:/opt/nifi/nifi-current/flowfile_repository
      - nifi_provenance:/opt/nifi/nifi-current/provenance_repository

volumes:
  zookeeper_data:
  nifi_content:
  nifi_db:
  nifi_flowfile:
  nifi_provenance:

Operations:

# Start with 3 nodes
docker compose up --scale nifi=3 -d

# View logs
docker compose logs -f nifi

# Scale to 5 nodes
docker compose up --scale nifi=5 -d

# Stop everything
docker compose down

# Access UI (each node on different port)
https://localhost:8080/nifi   # Node 1
https://localhost:8081/nifi   # Node 2
https://localhost:8082/nifi   # Node 3

Cluster notes:

ZooKeeper elects a Cluster Coordinator automatically.
The Primary Node is separately elected — schedule listing processors to run "on Primary Node only."
Scaling down loses in-flight data on removed nodes; drain queues first.

4.8 NiFi on Kubernetes with Helm Chart

Add Apache Helm repository:

helm repo add apache https://charts.apache.org
helm repo update

Basic values.yaml:

replicaCount: 3

image:
  repository: apache/nifi
  tag: "2.8.0"

service:
  type: ClusterIP
  port: 8080

persistence:
  enabled: true
  storageClass: "fast-ssd"   # Must exist in your cluster
  size: 50Gi

zookeeper:
  enabled: true
  replicaCount: 3

properties:
  sensitive:
    key: "your-12-char-minimum-passphrase"

Install:

helm install nifi apache/nifi -f values.yaml -n nifi --create-namespace

# Watch pods come up
kubectl get pods -n nifi -w

# Port-forward for testing
kubectl port-forward -n nifi svc/nifi 8080:8080

# Access at http://localhost:8080/nifi

4.9 NiFi Registry and Git-Based Flow Version Control

NiFi 2.x Git-based Registry (recommended):

1. In UI: Hamburger menu → Controller Settings → Registry Clients
2. Add: GitHubFlowRegistryClient (or GitLabFlowRegistryClient, etc.)
3. Configure:
   Repository URL: https://github.com/yourorg/nifi-flows.git
   Branch: main
   Access Token: (GitHub personal access token with repo scope)
4. Right-click Process Group → Version → Start Version Control
5. Select Registry Client and bucket (directory in repo)
6. On each change, right-click → Version → Commit Changes

Each version is stored as JSON in the git repo, enabling:

Full audit trail
Rollback to previous versions
CI/CD pipeline integration
Collaborative flow development

4.10 Security Hardening

LDAP Authentication

Edit /opt/nifi/conf/login-identity-providers.xml:

<provider>
  <identifier>ldap-provider</identifier>
  <class>org.apache.nifi.ldap.LdapProvider</class>
  <property name="Authentication Strategy">SIMPLE</property>
  <property name="Manager DN">cn=admin,dc=example,dc=com</property>
  <property name="Manager Password">password</property>
  <property name="TLS - Keystore">./conf/keystore.p12</property>
  <property name="TLS - Truststore">./conf/truststore.p12</property>
  <property name="Referral Strategy">FOLLOW</property>
  <property name="Connect Timeout">10 secs</property>
  <property name="Read Timeout">10 secs</property>
  <property name="Url">ldap://ldap.example.com:389/dc=example,dc=com</property>
  <property name="User Search Filter">(uid={0})</property>
  <property name="Identity Strategy">USE_USERNAME</property>
</provider>

Then update nifi.properties:

nifi.security.user.login.identity.provider=ldap-provider

Authorization: Role-Based Access Control

Edit /opt/nifi/conf/authorizers.xml to define users and permissions:

<authorizer>
  <identifier>managed-authorizer</identifier>
  <class>org.apache.nifi.authorization.StandardManagedAuthorizer</class>
  <property name="Access Control Strategy">with-external-authorization-provider</property>
  <property name="Authorizations File">./conf/authorizations.xml</property>
  <property name="Users File">./conf/users.xml</property>
  <property name="Initial Admin Identity">admin</property>
</authorizer>

In UI, assign users to roles (Viewer, Editor, Admin) and set processor/canvas permissions.

Securing Sensitive Properties

All sensitive properties (passwords, API keys, S3 credentials) in processor configs are encrypted at rest using the nifi.sensitive.props.key. Set this before first run:

# In nifi.properties
nifi.sensitive.props.key=MySecure12CharKey!
nifi.sensitive.props.algorithm=NIFI_PBKDF2_AES_GCM_256

View an encrypted property:

# Encrypted in flow.xml.gz
cat conf/flow.xml.gz | gunzip | grep -A2 "sensitivePropertyValue"
# Output: enc{N4...} (encrypted)

4.11 Advanced Monitoring

Prometheus Metrics Scraping

NiFi 2.x exposes Prometheus metrics at: /nifi-api/flow/metrics/prometheus

Prometheus scrape config:

global:
  scrape_interval: 30s

scrape_configs:
  - job_name: 'nifi'
    scheme: https
    tls_config:
      insecure_skip_verify: true   # For self-signed; use ca_file in production
    basic_auth:
      username: 'admin'
      password: 'adminpassword123'
    static_configs:
      - targets: ['nifi1.hpc.example.com:8443']
    metrics_path: '/nifi-api/flow/metrics/prometheus'

Grafana dashboard:

Import dashboard ID 12314 (NiFi Prometheus Reporting Task Dashboard)
Shows real-time flow rates, queue sizes, processor latencies, cluster status

Data Provenance Queries

REST API examples:

# Search provenance events
curl -k -u admin:password \
  'https://nifi1.hpc.example.com:8443/nifi-api/provenance/search' \
  -X POST -d '{
    "request": {
      "searchTerms": {
        "ProcessorID": "<processor-uuid>",
        "EventType": "SEND"
      },
      "startDate": "2026-04-10T00:00:00Z",
      "endDate": "2026-04-10T23:59:59Z"
    }
  }'

# Get lineage for a FlowFile
curl -k -u admin:password \
  'https://nifi1.hpc.example.com:8443/nifi-api/provenance/<event-id>/lineage'

# Replay a FlowFile from a specific point
curl -k -u admin:password \
  'https://nifi1.hpc.example.com:8443/nifi-api/provenance/<event-id>/replay' \
  -X POST

Bulletins and Warnings

View in UI: Hamburger menu → Bulletin Board

Configure per-processor:

Right-click processor → Configure → Settings
Set Bulletin Level: DEBUG / INFO / WARN / ERROR
Only messages at or above this level appear

5. Practical Examples

Example 1 — HPC Job Completion Pipeline

Monitor a Slurm job output directory, validate results, archive to Ceph, and notify team.

[ListFile: /scratch/jobs/*/output/]
  (Primary Node, 60 sec)
  File Filter: .*\.result$
  Minimum File Age: 30 sec
  --> success
[FetchFile: Move to /scratch/jobs/*/output-processed/]
  --> success
[ExecuteStreamCommand: validate results]
  Command: /bin/bash
  Arguments: -c "python3 /opt/scripts/validate.py ${absolute.path}"
  --> success
[RouteOnContent: check validation output]
  Property "valid": ^PASS
  Property "invalid": ^FAIL
  --> valid
[UpdateAttribute]
  s3.bucket: job-results
  s3.key: ${path:substringAfter('jobs/'):substringBefore('/output')}/${filename}
  --> [PutS3Object]
      Bucket: ${s3.bucket}
      Object Key: ${s3.key}
      Endpoint: http://ceph-rgw:7480
      --> success
[PutSlack: "Result ${s3.key} validated and archived"]

  --> invalid
[PutSlack: "Result validation FAILED: ${absolute.path}"]
[PutEmail: alert@hpc.example.com with attachment]

Example 2 — Multi-Tier Data Archival

Route files to different storage based on size and age.

[ListFile: /beegfs/archive-queue/]
  Minimum File Age: 1 day
  --> [FetchFile]
  --> [RouteOnAttribute]
      "small": ${fileSize:lt(1073741824)}          # < 1 GB
      "large": ${fileSize:ge(1073741824)}          # >= 1 GB
      --> small
[UpdateAttribute]
  s3.bucket: archive-cold
  storage.tier: cold-archive
  --> [PutS3Object: Ceph STANDARD storage class]

  --> large
[UpdateAttribute]
  s3.bucket: archive-warm
  storage.tier: warm-archive
  --> [PutS3Object: Ceph INTELLIGENT_TIERING storage class]

Example 3 — Load Balancing Across Cluster Nodes

Distribute processing across all nodes (not just Primary Node).

[ListFile: /beegfs/input/ (Primary Node only)]
  --> success
[FetchFile]
  --> success (processed on ANY node)
[ExecuteStreamCommand: heavy computation]
  (will load-balance across cluster)
  --> success
[PutS3Object: upload results]

6. Hands-On Exercises

Exercise 1 — Cluster Setup and Monitoring

Deploy 3-node NiFi cluster using Docker Compose
Verify all nodes register with ZooKeeper
View cluster status in UI: Hamburger → Summary
Scale up to 5 nodes, observe queue redistribution
Kill one node's container and observe failover

Exercise 2 — Provenance Replay

Create a simple flow: ListFile → ReplaceText → PutFile
Introduce an error in ReplaceText (bad regex)
Fix the error
Use Data Provenance → Replay to reprocess failed FlowFiles without re-ingesting source files

Exercise 3 — Slurm Job Integration

Set up a Slurm test environment (or use slurmctld in Docker)
Build complete pipeline: monitor input → sbatch → squeue poll → process output → Ceph archive
Submit a real job, watch NiFi track its lifecycle
Test failure handling (kill job, verify alert routing)

Exercise 4 — Git-Based Flow Version Control

Create a GitHub repository: nifi-flows
In NiFi UI, set up GitHub Flow Registry Client
Create a Process Group, start version control
Make changes, commit (creates JSON in git)
Verify git history: git log --oneline nifi/flows.json
Rollback to previous version in UI

Exercise 5 — Security Hardening

Generate TLS certificates using tls-toolkit.sh for 3 nodes
Configure cluster with HTTPS and mutual TLS
Set up LDAP authentication (use OpenLDAP in Docker for testing)
Create users and assign roles (Viewer, Editor, Admin)
Verify role-based access control (viewer can't edit, admin can)

7. Troubleshooting

Cluster coordination fails, nodes won't elect coordinator

Symptom: NiFi logs show "Unable to connect to ZooKeeper" or coordinator never elected.

Diagnosis:

Check ZooKeeper is running: echo ruok | nc localhost 2181
Verify network connectivity between nodes and ZK: telnet zk-host 2181
Check nifi.properties zookeeper.connect.string on all nodes (must be identical)

Fix: Ensure ZooKeeper quorum is correct and all nodes can reach it. Restart NiFi after fixing connection.

Primary Node never elected

Symptom: Listing processors (ListFile, ListSFTP, ListS3) show "Node does not have the Primary role."

Diagnosis:

Check ZooKeeper is healthy
Verify cluster has stable quorum (odd number of nodes recommended)
Check logs for coordination errors

Fix: Restart coordinator node (or entire cluster if persistent).

Files disappear from BeeGFS but don't appear in next processor

Symptom: ListFile deletes source, but FetchFile fails with "File not found."

Diagnosis:

Completion Strategy on ListFile is "Delete" (not recommended)
File is deleted by external process between ListFile and FetchFile

Fix:

Set ListFile Keep Source File: true for testing
Use FetchFile Completion Strategy: Move to Directory instead

High memory usage, NiFi crashes

Symptom: NiFi process uses > 10 GB RAM or is killed by OOM killer.

Diagnosis:

Check System Diagnostics: Hamburger → System Diagnostics
Look at content repository disk space

Fix:

Increase JVM heap: Edit /opt/nifi/conf/bootstrap.conf, raise java.arg.3 (Xmx)
Reduce queue back-pressure thresholds to prevent queues from growing unbounded
Enable disk-based queue overflow (advanced config)
Delete old provenance data: rm -rf ./provenance_repository/ then restart

Ceph S3 upload fails with "Signature does not match"

Symptom: PutS3Object returns 403 Forbidden with signature error.

Diagnosis:

Access key or secret key is wrong
Endpoint URL format is incorrect
Region mismatch

Fix:

Verify credentials: test with aws s3 ls s3://bucket --endpoint-url http://ceph-rgw:7480
Check Use Path Style Access is enabled (required for Ceph)
Verify Endpoint Override URL format: http://ceph-rgw.hpc.example.com:7480 (no trailing slash)

Slurm integration stops updating job status

Symptom: ExecuteStreamCommand runs but squeue output stops updating.

Diagnosis:

Job has finished and aged off squeue output
SSH connection to Slurm node is stale

Fix:

Use Slurm REST API (slurmrestd) instead of CLI (more reliable)
Add timeout to ExecuteStreamCommand and add retry logic

8. References

Official NiFi Documentation

Processor Documentation

Integration Guides

HPC-Specific Integration

Monitoring and Security

Community Resources

[[apache-nifi-hpc-sysadmin-beginner-guide|Apache NiFi HPC Sysadmin Beginner Guide]]
[[kubernetes-beginner-guide]]
[[docker-beginner-guide]]
[[linux-permissions-deep-dive]]
[[slurm-administrator-guide]]

Summary

Key takeaways:

Clustering with ZooKeeper enables fault tolerance, automatic failover, and load distribution across nodes.
Primary Node election ensures listing processors (ListFile, ListS3, ListSFTP) run once per cycle, preventing duplicate processing.
Data Provenance provides complete audit trail and replay capabilities for debugging and compliance.
Slurm integration automates HPC job submission, status monitoring, and result processing through ExecuteProcess and InvokeHTTP.
BeeGFS/Ceph integration leverages NiFi's file processors (no special connectors needed) for multi-tier data archival and movement.
Kubernetes deployment via Helm or NiFiKop operator enables cloud-native HPC dataflow automation.
Git-based flow version control (NiFi 2.x) enables collaborative flow development and CI/CD integration.
Security hardening includes TLS, LDAP/Kerberos authentication, RBAC, and encrypted sensitive properties.
Monitoring via Prometheus/Grafana and provenance queries enables real-time visibility into flow health and data lineage.

Next steps:

Deploy a production 3-node NiFi cluster with external ZooKeeper on your HPC infrastructure.
Build a complete BeeGFS → Slurm → Ceph → Notification pipeline tailored to your workflows.
Integrate NiFi monitoring with your existing Prometheus/Grafana stack.
Set up Git-based flow version control for collaborative development across your team.
Evaluate Kubernetes deployment if your HPC center is containerization-ready.
Document runbooks for common operational tasks (scaling, backup/restore, certificate renewal).

1. Overview​

2. Prerequisites​

3. Key Concepts​

3.1 NiFi Core Abstractions​

Expression Language (EL)​

3.2 Cluster Architecture​

3.3 Data Provenance Architecture​

4. Step-by-Step Instructions​

4.1 Installing NiFi as a Systemd Service​

4.2 Configuring NiFi for Production​

4.3 Generating TLS Certificates for Cluster​

4.4 Building a Slurm Integration Pipeline​

4.5 BeeGFS Integration Pipeline​

4.6 Ceph S3 Integration with AWSCredentialsProviderControllerService​

4.7 Docker Compose Multi-Node Cluster​

4.8 NiFi on Kubernetes with Helm Chart​

4.9 NiFi Registry and Git-Based Flow Version Control​

4.10 Security Hardening​

LDAP Authentication​

Authorization: Role-Based Access Control​

Securing Sensitive Properties​

4.11 Advanced Monitoring​

Prometheus Metrics Scraping​

Data Provenance Queries​

Bulletins and Warnings​

5. Practical Examples​

Example 1 — HPC Job Completion Pipeline​

Example 2 — Multi-Tier Data Archival​

Example 3 — Load Balancing Across Cluster Nodes​

6. Hands-On Exercises​

Exercise 1 — Cluster Setup and Monitoring​

Exercise 2 — Provenance Replay​

Exercise 3 — Slurm Job Integration​

Exercise 4 — Git-Based Flow Version Control​

Exercise 5 — Security Hardening​

7. Troubleshooting​

Cluster coordination fails, nodes won't elect coordinator​

Primary Node never elected​

Files disappear from BeeGFS but don't appear in next processor​

High memory usage, NiFi crashes​

Ceph S3 upload fails with "Signature does not match"​

Slurm integration stops updating job status​

8. References​

Official NiFi Documentation​

Processor Documentation​

Integration Guides​

HPC-Specific Integration​

Monitoring and Security​

Community Resources​

9. Related Tutorials​

Summary​

1. Overview

2. Prerequisites

3. Key Concepts

3.1 NiFi Core Abstractions

Expression Language (EL)

3.2 Cluster Architecture

3.3 Data Provenance Architecture

4. Step-by-Step Instructions

4.1 Installing NiFi as a Systemd Service

4.2 Configuring NiFi for Production

4.3 Generating TLS Certificates for Cluster

4.4 Building a Slurm Integration Pipeline

4.5 BeeGFS Integration Pipeline

4.6 Ceph S3 Integration with AWSCredentialsProviderControllerService

4.7 Docker Compose Multi-Node Cluster

4.8 NiFi on Kubernetes with Helm Chart

4.9 NiFi Registry and Git-Based Flow Version Control

4.10 Security Hardening

LDAP Authentication

Authorization: Role-Based Access Control

Securing Sensitive Properties

4.11 Advanced Monitoring

Prometheus Metrics Scraping

Data Provenance Queries

Bulletins and Warnings

5. Practical Examples

Example 1 — HPC Job Completion Pipeline

Example 2 — Multi-Tier Data Archival

Example 3 — Load Balancing Across Cluster Nodes

6. Hands-On Exercises

Exercise 1 — Cluster Setup and Monitoring

Exercise 2 — Provenance Replay

Exercise 3 — Slurm Job Integration

Exercise 4 — Git-Based Flow Version Control

Exercise 5 — Security Hardening

7. Troubleshooting

Cluster coordination fails, nodes won't elect coordinator

Primary Node never elected

Files disappear from BeeGFS but don't appear in next processor

High memory usage, NiFi crashes

Ceph S3 upload fails with "Signature does not match"

Slurm integration stops updating job status

8. References

Official NiFi Documentation

Processor Documentation

Integration Guides

HPC-Specific Integration

Monitoring and Security

Community Resources

9. Related Tutorials

Summary