Apache NiFi for HPC System Administrators: Deep Dive Reference
1. Overview
This deep-dive reference covers advanced Apache NiFi topics for HPC system administrators: clustering architecture, integration with HPC schedulers (Slurm) and storage systems (BeeGFS, Ceph), containerized deployments (Docker, Kubernetes), flow version control (Registry and Git), security hardening, and comprehensive monitoring strategies.
This guide assumes familiarity with basic NiFi concepts (FlowFile, Processor, Connection, Expression Language) covered in the [[apache-nifi-hpc-sysadmin-beginner-guide|Beginner Guide]]. It is designed as a reference you can return to as you design and maintain production HPC dataflow pipelines.
2. Prerequisites
Before working through this guide, you should have:
- A working standalone NiFi 2.x installation (from the Beginner Guide)
- Comfortable with Linux system administration (users, groups, file permissions, systemd)
- Access to HPC infrastructure or simulators:
- Slurm cluster (or ability to run
slurmrestdstandalone) - BeeGFS filesystem (or equivalent POSIX-compliant parallel filesystem)
- Ceph cluster with S3 RadosGW or access to AWS S3
- Slurm cluster (or ability to run
- Docker and Docker Compose installed (for clustered testing)
- Kubernetes cluster (for K8s deployment section; minikube or a test cluster acceptable)
- Git repository for flow version control (GitHub, GitLab, Bitbucket, or local)
- Text editor or IDE for configuration files and scripts
3. Key Concepts
3.1 NiFi Core Abstractions
Expression Language (EL)
NiFi Expression Language enables dynamic processor configuration using FlowFile attributes. Syntax: ${attribute_name} or ${attribute_name:function()}.
Common patterns for HPC:
# Extract job ID from Slurm output
${slurm.output:substringAfter('job')
:substringBefore(':')}
# Construct S3 key with date partitioning
jobs/${job.id}/${now():format('yyyy/MM/dd')}/${filename}
# Conditional routing by job status
${job.status:equals('COMPLETED')}
# Test file size threshold
${fileSize:gt(1073741824)} # > 1 GB
# Dynamic SFTP host from attribute
sftp://${sftp.remote.host}:${sftp.remote.port}/${path}${filename}
Function categories:
- String:
toUpper(),toLower(),trim(),substringBefore(),substringAfter(),length(),substring(),replace(),split() - Numeric:
lt(),gt(),le(),ge(),equals(),plus(),minus() - Boolean:
matches(),contains(),startsWith(),endsWith() - Date:
now(),format(),toDate(),fromDate() - Type conversion:
toNumber(),toBoolean()
3.2 Cluster Architecture
A NiFi cluster consists of:
- ZooKeeper: Distributed coordination service for cluster state (leader election, configuration sync).
- Multiple NiFi nodes: Each node can process FlowFiles independently.
- Cluster Coordinator: Elected by ZooKeeper; manages cluster-wide state and flow propagation.
- Primary Node: A separately elected role; only processes that require single-point coordination (ListFile, ListSFTP, ListS3) run here.
Key behaviors:
- Flow synchronization: The Cluster Coordinator syncs the flow definition to all nodes.
- Queue replication: Connection queues are replicated across nodes for fault tolerance.
- Load balancing: ProcessGroups can distribute load across available nodes.
- Backpressure: If a node's queue fills (by data size or FlowFile count), upstream nodes pause.
3.3 Data Provenance Architecture
NiFi maintains a complete audit trail in the provenance repository:
- Every FlowFile creation, read, write, route, fork, join, drop event is recorded.
- Records include: processor ID, timestamp, FlowFile UUID, attribute snapshots, lineage.
- Stored on disk in configurable location (default:
./provenance_repository/). - Queryable via the Data Provenance UI or REST API.
- Replay feature: Resend a FlowFile from any point in its history to re-run processing.
4. Step-by-Step Instructions
4.1 Installing NiFi as a Systemd Service
For production deployments, run NiFi as a managed systemd service for automatic restart and clean shutdown:
# Install as a service
/opt/nifi/bin/nifi.sh install
# Start the service
systemctl start nifi
# Enable auto-start on boot
systemctl enable nifi
# Check status
systemctl status nifi
# View logs
journalctl -u nifi -f
Expected output:
● nifi.service - Apache NiFi
Loaded: loaded (/opt/nifi/bin/nifi > loaded to systemd)
Active: active (running) since Thu 2026-04-10 12:00:00 UTC
4.2 Configuring NiFi for Production
Edit /opt/nifi/conf/nifi.properties:
# Web UI security (HTTPS required for cluster)
nifi.web.https.host=nifi1.hpc.example.com
nifi.web.https.port=8443
# TLS keystore/truststore (generated by tls-toolkit.sh)
nifi.security.keystore=./conf/keystore.p12
nifi.security.keystoreType=PKCS12
nifi.security.keystorePasswd=<generated-password>
nifi.security.truststore=./conf/truststore.p12
nifi.security.truststoreType=PKCS12
nifi.security.trusstorePasswd=<generated-password>
# Authentication
nifi.security.user.login.identity.provider=single-user-provider
# Encryption of sensitive properties
nifi.sensitive.props.key=<12+ character passphrase>
nifi.sensitive.props.algorithm=NIFI_PBKDF2_AES_GCM_256
# Repository paths (can point to fast/large storage)
nifi.content.repository.directory.default=/data/nifi/content_repository
nifi.flowfile.repository.directory=/data/nifi/flowfile_repository
nifi.provenance.repository.directory.default=/data/nifi/provenance_repository
nifi.database.repository.directory=./database_repository
# Provenance storage limits
nifi.provenance.repository.max.storage.size=50 GB
nifi.provenance.repository.max.storage.time=30 days
nifi.provenance.repository.rollover.time=10 mins
# JVM memory settings (in bootstrap.conf)
# java.arg.2=-Xms4g
# java.arg.3=-Xmx16g
# Cluster mode (set per-node)
nifi.cluster.is.node=false # Set true on each node
nifi.cluster.node.address=nifi1.hpc.example.com
nifi.cluster.node.protocol.port=11443
nifi.zookeeper.connect.string=zk1.hpc.example.com:2181,zk2.hpc.example.com:2181,zk3.hpc.example.com:2181
nifi.cluster.flow.election.max.candidates=3
nifi.cluster.protocol.is.secure=true
4.3 Generating TLS Certificates for Cluster
Use NiFi's TLS Toolkit to generate proper certificates:
# Generate certificates for all nodes
/opt/nifi/bin/tls-toolkit.sh standalone \
-n "nifi1.hpc.example.com,nifi2.hpc.example.com,nifi3.hpc.example.com" \
-C "CN=admin,OU=NiFi" \
-o ./nifi-certs
# Output: nifi-certs/nifi1.hpc.example.com/, nifi2.hpc.example.com/, etc.
# Each contains: nifi.properties, keystore.p12, truststore.p12
For each node:
- Copy the node's directory contents to
/opt/nifi/conf/ - Update nifi.properties with the node address and ZK quorum
- Set proper permissions:
chmod 600 conf/keystore.p12 conf/truststore.p12
4.4 Building a Slurm Integration Pipeline
Complete workflow: Monitor input → Submit Slurm job → Poll status → Process output → Archive results
Components:
- Input Monitor —
ListFilewatching/scratch/jobs/input/ - Job Submission —
ExecuteProcessrunningsbatch - Job ID Extraction —
ExtractTextparsing sbatch output - Status Polling —
ExecuteStreamCommandrunningsqueueon a loop - Routing —
RouteOnAttributebranching on job status (RUNNING/COMPLETED/FAILED) - Output Processing —
ListFilemonitoring job output directory - Archive —
PutS3Objectuploading to Ceph - Notification —
PutSlackorPutEmailon completion
Configuration example:
[ListFile: /scratch/jobs/input/]
Input Directory: /scratch/jobs/input/
File Filter: .*\.job$
Minimum File Age: 30 sec
--> success
[UpdateAttribute]
job.submitted.time = ${now():format('yyyy-MM-dd HH:mm:ss')}
job.input.file = ${absolute.path}
--> [ExecuteProcess: sbatch]
Command: /usr/bin/sbatch
Arguments: --job-name=${filename:substringBefore('.')} --time=02:00:00 \
--export=INPUT_FILE=${job.input.file} \
/opt/scripts/process.sh
--> success
[ExtractText]
Property: slurm.job.id
Value (regex): Submitted batch job (\d+)
--> success
[Wait]
Wait Duration: 30 sec
--> [ExecuteStreamCommand: squeue polling loop]
Command: /bin/bash
Arguments: -c "squeue --job=${slurm.job.id} --format='%T' --noheader"
--> success
[ExtractText]
Property: slurm.status
Value: (RUNNING|PENDING|COMPLETED|FAILED)
-->
[RouteOnAttribute]
"completed": ${slurm.status:equals('COMPLETED')}
"failed": ${slurm.status:equals('FAILED')}
"pending": ${slurm.status:matches('RUNNING|PENDING')}
completed --> [ListFile: /scratch/jobs/${slurm.job.id}/output/]
--> [FetchFile]
--> [UpdateAttribute]
s3.key = jobs/${slurm.job.id}/${now():format('yyyy/MM/dd')}/${filename}
--> [PutS3Object]
Bucket: job-results
Endpoint: http://ceph-rgw.hpc.example.com:7480
--> success
--> [PutSlack]
Message: Job ${slurm.job.id} completed!
failed --> [PutSlack]
Message: Job ${slurm.job.id} FAILED
pending --> [RouteOnAttribute: decision point, feeds back to squeue loop]
4.5 BeeGFS Integration Pipeline
BeeGFS is a POSIX-compliant parallel filesystem. No special integration needed — use standard file processors:
# Monitor BeeGFS for new data
ListFile:
Input Directory: /beegfs/scratch/experiments/
File Filter: .*\.h5$ # HDF5 files
Minimum File Age: 60 sec # Files still being written
Recurse Subdirectories: true
# Read file content
FetchFile:
File to Fetch: ${absolute.path}
Completion Strategy: Move to Directory
Move Destination: /beegfs/scratch/processing/
# Transform path for S3 upload
UpdateAttribute:
s3.bucket: hpc-data
s3.key: raw/${now():format('yyyy/MM/dd')}/${filename}
# Upload to Ceph
PutS3Object:
Bucket: ${s3.bucket}
Object Key: ${s3.key}
Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
AWS Credentials Provider: CephCredentials
Performance considerations:
- Set
Minimum File Ageto prevent reading files still being written (typically 30-60 seconds) - For large files (>1 GB), ensure NiFi heap is sized appropriately (content repository holds file in memory temporarily)
- Consider keeping NiFi content repository on fast local NVMe, separate from monitored BeeGFS paths
- Use FetchFile Move to Directory completion strategy to atomically move processed files, preventing re-ingestion
4.6 Ceph S3 Integration with AWSCredentialsProviderControllerService
Setup Controller Service:
- Right-click canvas → Configure → Controller Services tab
- Add:
AWSCredentialsProviderControllerService - Configure:
Access Key ID: <ceph-rgw-access-key>Secret Access Key: <ceph-rgw-secret-key>
- Click lightning bolt to enable
Processor configuration:
For PutS3Object:
Bucket: hpc-results
Region: us-east-1 # Ceph uses us-east-1 as default
Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
AWS Credentials Provider Service: AWSCredentialsProviderControllerService
Use Path Style Access: true # REQUIRED for Ceph
Multipart Upload Threshold: 5 GB
Multipart Part Size: 100 MB
Content Type: ${mime.type}
For ListS3:
Bucket: hpc-raw-data
Prefix: incoming/2026/
Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
AWS Credentials Provider Service: AWSCredentialsProviderControllerService
For FetchS3Object:
Bucket: ${s3.bucket}
Object Key: ${s3.key}
Endpoint Override URL: http://ceph-rgw.hpc.example.com:7480
AWS Credentials Provider Service: AWSCredentialsProviderControllerService
Complete BeeGFS → Slurm → Ceph pipeline:
[ListFile: /beegfs/raw-data/] (Primary Node, 5 min)
--> [FetchFile: Move to /beegfs/in-processing/]
--> [UpdateAttribute]
s3.key = raw/${now():format('yyyy/MM/dd')}/${filename}
s3.bucket = hpc-input
slurm.input_s3_key = ${s3.key}
--> [PutS3Object: Endpoint=http://ceph-rgw:7480]
--> [UpdateAttribute]
slurm.script = /opt/scripts/process.sh
--> [ExecuteProcess: sbatch]
sbatch --export=INPUT_S3_KEY=${slurm.input_s3_key} \
${slurm.script}
--> [ExtractText: slurm.job.id = Submitted batch job (\d+)]
--> [Wait + squeue poll loop]
--> [RouteOnAttribute: completed/failed]
--> [PutSlack: "Pipeline complete: ${slurm.job.id}"]
4.7 Docker Compose Multi-Node Cluster
File: docker-compose.yml
version: '3.8'
services:
zookeeper:
image: bitnami/zookeeper:3.8
hostname: zookeeper
container_name: zookeeper
ports:
- "2181:2181"
environment:
ALLOW_ANONYMOUS_LOGIN: "yes"
ZOO_MY_ID: "1"
volumes:
- zookeeper_data:/bitnami/zookeeper
nifi:
image: apache/nifi:2.8.0
ports:
- "8080-8082:8080" # Maps host ports 8080-8082 to container 8080
environment:
NIFI_WEB_HTTP_PORT: "8080"
NIFI_CLUSTER_IS_NODE: "true"
NIFI_CLUSTER_NODE_PROTOCOL_PORT: "8082"
NIFI_ZK_CONNECT_STRING: "zookeeper:2181"
NIFI_ELECTION_MAX_WAIT: "1 min"
NIFI_ELECTION_MAX_CANDIDATES: "3"
SINGLE_USER_CREDENTIALS_USERNAME: "admin"
SINGLE_USER_CREDENTIALS_PASSWORD: "adminpassword123"
depends_on:
- zookeeper
volumes:
- nifi_content:/opt/nifi/nifi-current/content_repository
- nifi_db:/opt/nifi/nifi-current/database_repository
- nifi_flowfile:/opt/nifi/nifi-current/flowfile_repository
- nifi_provenance:/opt/nifi/nifi-current/provenance_repository
volumes:
zookeeper_data:
nifi_content:
nifi_db:
nifi_flowfile:
nifi_provenance:
Operations:
# Start with 3 nodes
docker compose up --scale nifi=3 -d
# View logs
docker compose logs -f nifi
# Scale to 5 nodes
docker compose up --scale nifi=5 -d
# Stop everything
docker compose down
# Access UI (each node on different port)
https://localhost:8080/nifi # Node 1
https://localhost:8081/nifi # Node 2
https://localhost:8082/nifi # Node 3
Cluster notes:
- ZooKeeper elects a Cluster Coordinator automatically.
- The Primary Node is separately elected — schedule listing processors to run "on Primary Node only."
- Scaling down loses in-flight data on removed nodes; drain queues first.
4.8 NiFi on Kubernetes with Helm Chart
Add Apache Helm repository:
helm repo add apache https://charts.apache.org
helm repo update
Basic values.yaml:
replicaCount: 3
image:
repository: apache/nifi
tag: "2.8.0"
service:
type: ClusterIP
port: 8080
persistence:
enabled: true
storageClass: "fast-ssd" # Must exist in your cluster
size: 50Gi
zookeeper:
enabled: true
replicaCount: 3
properties:
sensitive:
key: "your-12-char-minimum-passphrase"
Install:
helm install nifi apache/nifi -f values.yaml -n nifi --create-namespace
# Watch pods come up
kubectl get pods -n nifi -w
# Port-forward for testing
kubectl port-forward -n nifi svc/nifi 8080:8080
# Access at http://localhost:8080/nifi
4.9 NiFi Registry and Git-Based Flow Version Control
NiFi 2.x Git-based Registry (recommended):
1. In UI: Hamburger menu → Controller Settings → Registry Clients
2. Add: GitHubFlowRegistryClient (or GitLabFlowRegistryClient, etc.)
3. Configure:
Repository URL: https://github.com/yourorg/nifi-flows.git
Branch: main
Access Token: (GitHub personal access token with repo scope)
4. Right-click Process Group → Version → Start Version Control
5. Select Registry Client and bucket (directory in repo)
6. On each change, right-click → Version → Commit Changes
Each version is stored as JSON in the git repo, enabling:
- Full audit trail
- Rollback to previous versions
- CI/CD pipeline integration
- Collaborative flow development
4.10 Security Hardening
LDAP Authentication
Edit /opt/nifi/conf/login-identity-providers.xml:
<provider>
<identifier>ldap-provider</identifier>
<class>org.apache.nifi.ldap.LdapProvider</class>
<property name="Authentication Strategy">SIMPLE</property>
<property name="Manager DN">cn=admin,dc=example,dc=com</property>
<property name="Manager Password">password</property>
<property name="TLS - Keystore">./conf/keystore.p12</property>
<property name="TLS - Truststore">./conf/truststore.p12</property>
<property name="Referral Strategy">FOLLOW</property>
<property name="Connect Timeout">10 secs</property>
<property name="Read Timeout">10 secs</property>
<property name="Url">ldap://ldap.example.com:389/dc=example,dc=com</property>
<property name="User Search Filter">(uid={0})</property>
<property name="Identity Strategy">USE_USERNAME</property>
</provider>
Then update nifi.properties:
nifi.security.user.login.identity.provider=ldap-provider
Authorization: Role-Based Access Control
Edit /opt/nifi/conf/authorizers.xml to define users and permissions:
<authorizer>
<identifier>managed-authorizer</identifier>
<class>org.apache.nifi.authorization.StandardManagedAuthorizer</class>
<property name="Access Control Strategy">with-external-authorization-provider</property>
<property name="Authorizations File">./conf/authorizations.xml</property>
<property name="Users File">./conf/users.xml</property>
<property name="Initial Admin Identity">admin</property>
</authorizer>
In UI, assign users to roles (Viewer, Editor, Admin) and set processor/canvas permissions.
Securing Sensitive Properties
All sensitive properties (passwords, API keys, S3 credentials) in processor configs are encrypted at rest using the nifi.sensitive.props.key. Set this before first run:
# In nifi.properties
nifi.sensitive.props.key=MySecure12CharKey!
nifi.sensitive.props.algorithm=NIFI_PBKDF2_AES_GCM_256
View an encrypted property:
# Encrypted in flow.xml.gz
cat conf/flow.xml.gz | gunzip | grep -A2 "sensitivePropertyValue"
# Output: enc{N4...} (encrypted)
4.11 Advanced Monitoring
Prometheus Metrics Scraping
NiFi 2.x exposes Prometheus metrics at: /nifi-api/flow/metrics/prometheus
Prometheus scrape config:
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'nifi'
scheme: https
tls_config:
insecure_skip_verify: true # For self-signed; use ca_file in production
basic_auth:
username: 'admin'
password: 'adminpassword123'
static_configs:
- targets: ['nifi1.hpc.example.com:8443']
metrics_path: '/nifi-api/flow/metrics/prometheus'
Grafana dashboard:
- Import dashboard ID 12314 (NiFi Prometheus Reporting Task Dashboard)
- Shows real-time flow rates, queue sizes, processor latencies, cluster status
Data Provenance Queries
REST API examples:
# Search provenance events
curl -k -u admin:password \
'https://nifi1.hpc.example.com:8443/nifi-api/provenance/search' \
-X POST -d '{
"request": {
"searchTerms": {
"ProcessorID": "<processor-uuid>",
"EventType": "SEND"
},
"startDate": "2026-04-10T00:00:00Z",
"endDate": "2026-04-10T23:59:59Z"
}
}'
# Get lineage for a FlowFile
curl -k -u admin:password \
'https://nifi1.hpc.example.com:8443/nifi-api/provenance/<event-id>/lineage'
# Replay a FlowFile from a specific point
curl -k -u admin:password \
'https://nifi1.hpc.example.com:8443/nifi-api/provenance/<event-id>/replay' \
-X POST
Bulletins and Warnings
View in UI: Hamburger menu → Bulletin Board
Configure per-processor:
- Right-click processor → Configure → Settings
- Set Bulletin Level: DEBUG / INFO / WARN / ERROR
- Only messages at or above this level appear
5. Practical Examples
Example 1 — HPC Job Completion Pipeline
Monitor a Slurm job output directory, validate results, archive to Ceph, and notify team.
[ListFile: /scratch/jobs/*/output/]
(Primary Node, 60 sec)
File Filter: .*\.result$
Minimum File Age: 30 sec
--> success
[FetchFile: Move to /scratch/jobs/*/output-processed/]
--> success
[ExecuteStreamCommand: validate results]
Command: /bin/bash
Arguments: -c "python3 /opt/scripts/validate.py ${absolute.path}"
--> success
[RouteOnContent: check validation output]
Property "valid": ^PASS
Property "invalid": ^FAIL
--> valid
[UpdateAttribute]
s3.bucket: job-results
s3.key: ${path:substringAfter('jobs/'):substringBefore('/output')}/${filename}
--> [PutS3Object]
Bucket: ${s3.bucket}
Object Key: ${s3.key}
Endpoint: http://ceph-rgw:7480
--> success
[PutSlack: "Result ${s3.key} validated and archived"]
--> invalid
[PutSlack: "Result validation FAILED: ${absolute.path}"]
[PutEmail: alert@hpc.example.com with attachment]
Example 2 — Multi-Tier Data Archival
Route files to different storage based on size and age.
[ListFile: /beegfs/archive-queue/]
Minimum File Age: 1 day
--> [FetchFile]
--> [RouteOnAttribute]
"small": ${fileSize:lt(1073741824)} # < 1 GB
"large": ${fileSize:ge(1073741824)} # >= 1 GB
--> small
[UpdateAttribute]
s3.bucket: archive-cold
storage.tier: cold-archive
--> [PutS3Object: Ceph STANDARD storage class]
--> large
[UpdateAttribute]
s3.bucket: archive-warm
storage.tier: warm-archive
--> [PutS3Object: Ceph INTELLIGENT_TIERING storage class]
Example 3 — Load Balancing Across Cluster Nodes
Distribute processing across all nodes (not just Primary Node).
[ListFile: /beegfs/input/ (Primary Node only)]
--> success
[FetchFile]
--> success (processed on ANY node)
[ExecuteStreamCommand: heavy computation]
(will load-balance across cluster)
--> success
[PutS3Object: upload results]
6. Hands-On Exercises
Exercise 1 — Cluster Setup and Monitoring
- Deploy 3-node NiFi cluster using Docker Compose
- Verify all nodes register with ZooKeeper
- View cluster status in UI: Hamburger → Summary
- Scale up to 5 nodes, observe queue redistribution
- Kill one node's container and observe failover
Exercise 2 — Provenance Replay
- Create a simple flow: ListFile → ReplaceText → PutFile
- Introduce an error in ReplaceText (bad regex)
- Fix the error
- Use Data Provenance → Replay to reprocess failed FlowFiles without re-ingesting source files
Exercise 3 — Slurm Job Integration
- Set up a Slurm test environment (or use
slurmctldin Docker) - Build complete pipeline: monitor input → sbatch → squeue poll → process output → Ceph archive
- Submit a real job, watch NiFi track its lifecycle
- Test failure handling (kill job, verify alert routing)
Exercise 4 — Git-Based Flow Version Control
- Create a GitHub repository:
nifi-flows - In NiFi UI, set up GitHub Flow Registry Client
- Create a Process Group, start version control
- Make changes, commit (creates JSON in git)
- Verify git history:
git log --oneline nifi/flows.json - Rollback to previous version in UI
Exercise 5 — Security Hardening
- Generate TLS certificates using tls-toolkit.sh for 3 nodes
- Configure cluster with HTTPS and mutual TLS
- Set up LDAP authentication (use OpenLDAP in Docker for testing)
- Create users and assign roles (Viewer, Editor, Admin)
- Verify role-based access control (viewer can't edit, admin can)
7. Troubleshooting
Cluster coordination fails, nodes won't elect coordinator
Symptom: NiFi logs show "Unable to connect to ZooKeeper" or coordinator never elected.
Diagnosis:
- Check ZooKeeper is running:
echo ruok | nc localhost 2181 - Verify network connectivity between nodes and ZK:
telnet zk-host 2181 - Check nifi.properties
zookeeper.connect.stringon all nodes (must be identical)
Fix: Ensure ZooKeeper quorum is correct and all nodes can reach it. Restart NiFi after fixing connection.
Primary Node never elected
Symptom: Listing processors (ListFile, ListSFTP, ListS3) show "Node does not have the Primary role."
Diagnosis:
- Check ZooKeeper is healthy
- Verify cluster has stable quorum (odd number of nodes recommended)
- Check logs for coordination errors
Fix: Restart coordinator node (or entire cluster if persistent).
Files disappear from BeeGFS but don't appear in next processor
Symptom: ListFile deletes source, but FetchFile fails with "File not found."
Diagnosis:
Completion Strategyon ListFile is "Delete" (not recommended)- File is deleted by external process between ListFile and FetchFile
Fix:
- Set ListFile
Keep Source File: truefor testing - Use FetchFile
Completion Strategy: Move to Directoryinstead
High memory usage, NiFi crashes
Symptom: NiFi process uses > 10 GB RAM or is killed by OOM killer.
Diagnosis:
- Check System Diagnostics: Hamburger → System Diagnostics
- Look at content repository disk space
Fix:
- Increase JVM heap: Edit
/opt/nifi/conf/bootstrap.conf, raisejava.arg.3(Xmx) - Reduce queue back-pressure thresholds to prevent queues from growing unbounded
- Enable disk-based queue overflow (advanced config)
- Delete old provenance data:
rm -rf ./provenance_repository/then restart
Ceph S3 upload fails with "Signature does not match"
Symptom: PutS3Object returns 403 Forbidden with signature error.
Diagnosis:
- Access key or secret key is wrong
- Endpoint URL format is incorrect
- Region mismatch
Fix:
- Verify credentials: test with
aws s3 ls s3://bucket --endpoint-url http://ceph-rgw:7480 - Check
Use Path Style Accessis enabled (required for Ceph) - Verify
Endpoint Override URLformat:http://ceph-rgw.hpc.example.com:7480(no trailing slash)
Slurm integration stops updating job status
Symptom: ExecuteStreamCommand runs but squeue output stops updating.
Diagnosis:
- Job has finished and aged off squeue output
- SSH connection to Slurm node is stale
Fix:
- Use Slurm REST API (
slurmrestd) instead of CLI (more reliable) - Add timeout to ExecuteStreamCommand and add retry logic
8. References
Official NiFi Documentation
- Apache NiFi System Administrator's Guide
- Apache NiFi Expression Language Guide
- Apache NiFi User Guide
- NiFi Cluster Architecture
Processor Documentation
- ListFile Processor
- FetchFile Processor
- ListS3 Processor
- PutS3Object Processor
- ExecuteProcess Processor
- ExecuteStreamCommand Processor
- InvokeHTTP Processor
- RouteOnAttribute Processor
- JoltTransformJSON Processor
Integration Guides
- NiFiKop Kubernetes Operator
- Stackable NiFi Operator
- Apache NiFi Helm Chart
- NiFi Docker Community
- NiFi.rocks Docker Compose Clustering
HPC-Specific Integration
Monitoring and Security
Community Resources
9. Related Tutorials
- [[apache-nifi-hpc-sysadmin-beginner-guide|Apache NiFi HPC Sysadmin Beginner Guide]]
- [[kubernetes-beginner-guide]]
- [[docker-beginner-guide]]
- [[linux-permissions-deep-dive]]
- [[slurm-administrator-guide]]
Summary
Key takeaways:
- Clustering with ZooKeeper enables fault tolerance, automatic failover, and load distribution across nodes.
- Primary Node election ensures listing processors (ListFile, ListS3, ListSFTP) run once per cycle, preventing duplicate processing.
- Data Provenance provides complete audit trail and replay capabilities for debugging and compliance.
- Slurm integration automates HPC job submission, status monitoring, and result processing through ExecuteProcess and InvokeHTTP.
- BeeGFS/Ceph integration leverages NiFi's file processors (no special connectors needed) for multi-tier data archival and movement.
- Kubernetes deployment via Helm or NiFiKop operator enables cloud-native HPC dataflow automation.
- Git-based flow version control (NiFi 2.x) enables collaborative flow development and CI/CD integration.
- Security hardening includes TLS, LDAP/Kerberos authentication, RBAC, and encrypted sensitive properties.
- Monitoring via Prometheus/Grafana and provenance queries enables real-time visibility into flow health and data lineage.
Next steps:
- Deploy a production 3-node NiFi cluster with external ZooKeeper on your HPC infrastructure.
- Build a complete BeeGFS → Slurm → Ceph → Notification pipeline tailored to your workflows.
- Integrate NiFi monitoring with your existing Prometheus/Grafana stack.
- Set up Git-based flow version control for collaborative development across your team.
- Evaluate Kubernetes deployment if your HPC center is containerization-ready.
- Document runbooks for common operational tasks (scaling, backup/restore, certificate renewal).