Apache NiFi: Deep Dive Reference

1. Overview

This reference guide covers the complete Apache NiFi ecosystem at an advanced level: cluster architecture and ZooKeeper coordination, performance tuning and bottleneck analysis, security frameworks and encryption strategies, the Expression Language for dynamic configuration, advanced processor patterns for HPC and multi-cluster workflows, Git-based flow versioning, Docker and Kubernetes deployment, monitoring and observability infrastructure, and production-grade operational best practices.

This guide assumes you are familiar with basic NiFi concepts (FlowFiles, Processors, Connections, Process Groups) covered in the [[apache-nifi-beginner-guide|Apache NiFi Beginner Guide]]. It is designed as a reference you can return to as your workflows grow in complexity and production requirements demand deeper understanding.

2. Prerequisites

Before starting this deep dive, you should have:

Hands-on experience with standalone NiFi — you've built and debugged flows in the UI
Familiarity with basic concepts — FlowFiles, Processors, Connections, Controller Services, Data Provenance
Understanding of the Expression Language — at least the basics of ${attribute:function} syntax
Java 21+ — required for NiFi 2.x
A Linux/UNIX environment — commands in this guide assume a Bash shell
Docker and docker-compose (optional but recommended for cluster simulation)
Knowledge of distributed systems — basic understanding of coordination, consensus, and failure modes
Familiarity with monitoring tools — Prometheus, HTTP clients, log aggregation concepts

3. Key Concepts

3.1 Cluster Architecture and Coordination

NiFi cluster mode runs multiple NiFi nodes orchestrated by ZooKeeper (ZK). One node is elected the Primary Node (runs single-node-only processors like ListFile), and another is the Cluster Coordinator (manages cluster state, handles join/leave events).

Flow Distribution:

The Cluster Coordinator holds the authoritative flow definition
When you modify the flow via any node's UI, the coordinator broadcasts the change to all nodes
All nodes execute the flow, but Primary Node processors run only on the Primary Node

FlowFile Repository:

Each node maintains its own FlowFile repository (pointer to content and attributes)
When a node fails, its in-flight FlowFiles are recovered from the repository on restart
Connection backpressure prevents queue explosion: if a connection reaches its limit (e.g., 10,000 FlowFiles or 1 GB), upstream processors pause

Content Repository:

Content is stored on local fast storage (NVMe preferred)
Multiple processors reference the same content via copy-on-write pointers
Claim management prevents orphaned data after FlowFile termination

Provenance Repository:

Provenance is written asynchronously to disk and indexed by Lucene
Query performance degrades with size — rolling indices (hourly, daily) help
Set nifi.provenance.repository.retention.* to manage retention

3.2 ZooKeeper and Leader Election

NiFi uses ZooKeeper 3.9+ for distributed coordination:

Node registration — each NiFi node registers itself as an ephemeral node in ZK when it starts
Primary Node election — nodes compete for the Primary Node role; the winner gets an exclusive lock
Quorum management — if you lose more than half of your ZK ensemble, the cluster becomes read-only (unavailable)
Session timeouts — if a node loses ZK connection for longer than zookeeper.session.timeout, it is forcibly removed from the cluster

Typical ZK setup for HA:

3 ZK nodes for production (2-node ensemble is vulnerable; 1-node is not HA)
ZK cluster must have odd node counts for quorum
Separate ZK machines from NiFi machines (shared hardware defeats HA purpose)

3.3 Primary Node vs. All-Nodes Processors

Some processors are marked Primary Node Only (default isolation level):

Primary Node Only:

ListFile, ListSFTP, ListS3, ListDatabaseTable — produce one FlowFile per source item
Running these on all nodes would cause duplicate processing
They block in-flight FlowFiles on non-primary nodes until Primary Node role is acquired

All Nodes:

FetchFile, FetchS3Object, PutFile, PutS3Object — process inbound FlowFiles
Run on all nodes for horizontal scalability
Each node pulls from its local queue and processes independently

You can override execution mode on a per-processor basis via the Execution property in Configure → Scheduling, but understand the implications.

3.4 Expression Language Internals

NiFi's Expression Language (EL) evaluates at runtime per FlowFile. This enables dynamic configuration without flow restarts.

Subject-function composition:

${attribute-name:function-name:function-name:...}

Evaluation context:

The subject (left of first colon) is resolved first: attribute lookup, system property, or environment variable
Each function is applied left-to-right to the result
If any function fails (e.g., regex mismatch), the result is null and the connection may route to unmatched (RouteOnAttribute)

Common pitfalls:

Precedence: EL operators have strict precedence; use parentheses
Null handling: Null propagates through most functions; ${attr:isEmpty()} returns false for null, not true
Type coercion: Numbers are coerced to strings; comparison ${fileSize:gt(1000)} compares as strings lexicographically, not numerically — use gt(1000) syntax
Attribute names: Must be alphanumeric + underscores; nested attributes require custom processors

Advanced techniques:

# Conditional: if "env" attribute equals "prod", use secret key, else use dev key
${env:equals('prod'):ifTrue(#{prod.secret.key}):ifFalse(#{dev.secret.key})}

# Dynamic routing based on file size tiers
${fileSize:ge(1073741824):ifTrue('large'):ifFalse(${fileSize:ge(1048576):ifTrue('medium'):ifFalse('small')})}

# Extract job ID from filename pattern: "job-12345-data.csv"
${filename:substringAfter('job-'):substringBefore('-')}

3.5 Controller Service and Property Encryption

Controller Services are singleton resources instantiated once per NiFi instance (even in cluster, only one copy).

Common gotchas:

Services marked Sharing Profile: Restricted can only be referenced by processors in the same scope (process group)
Services marked Sharing Profile: Shared are global
Enabling a service in use by running processors requires those processors to be stopped
Changing a service's properties triggers a brief pause in dependent processors

Property Encryption:

All properties containing passwords, API keys, or other sensitive data are encrypted with AES-GCM-256 using the key in nifi.sensitive.props.key
This key is never transmitted — encrypted values are stored in flow.json.gz
Changing the encryption key will lose all encrypted properties — do not do this in production

3.6 Back-Pressure and Queue Management

Each connection has a configured back-pressure strategy:

Back-Pressure Object Threshold:

If queue depth (number of FlowFiles) exceeds this, upstream processor pauses
Default: no limit (dangerous for unlimited sources)
Recommendation: set to 10,000–100,000 depending on file size

Back-Pressure Data Size Threshold:

If queue size (total bytes) exceeds this, upstream processor pauses
Default: no limit
Recommendation: set to 1–5 GB for bounded memory usage

Overflow action:

Fail — drop FlowFile with error
Roll over — use secondary storage (slower but safer for large queues)

Monitoring back-pressure:

Check Summary table (top-right menu) — look for non-zero "out" values on connections
View Status History on a connection to see queue depth over time
If a connection consistently hits back-pressure limits, that processor is a bottleneck

4. Step-by-Step Instructions

4.1 Setting Up a 3-Node Cluster with Docker Compose

This setup uses Docker to simulate a real 3-node cluster for development and testing.

Prerequisites:

Docker 24+
docker-compose v2+
At least 6 GB RAM available (2 GB per NiFi node + ZK overhead)

Create docker-compose.yml:

version: "3.8"

services:
  zookeeper:
    image: bitnami/zookeeper:3.9
    container_name: nifi-zk
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
    ports:
      - "2181:2181"
    healthcheck:
      test: ["CMD", "nc", "-z", "localhost", "2181"]
      interval: 5s
      timeout: 3s
      retries: 3

  nifi-1:
    image: apache/nifi:2.8.0
    container_name: nifi-1
    depends_on:
      zookeeper:
        condition: service_healthy
    environment:
      # Cluster configuration
      - NIFI_CLUSTER_IS_NODE=true
      - NIFI_ZK_CONNECT_STRING=zookeeper:2181
      - NIFI_CLUSTER_ADDRESS=nifi-1
      - NIFI_CLUSTER_NODE_ADDRESS=nifi-1
      - NIFI_CLUSTER_NODE_PROTOCOL_PORT=8080
      # Security (single-user auth)
      - SINGLE_USER_CREDENTIALS_USERNAME=admin
      - SINGLE_USER_CREDENTIALS_PASSWORD=admin123456
      # JVM tuning
      - NIFI_JVM_HEAP_INIT=2g
      - NIFI_JVM_HEAP_MAX=2g
    ports:
      - "8443:8443"
      - "8080:8080"
    volumes:
      - nifi_1_content:/opt/nifi/nifi-current/content_repository
      - nifi_1_flowfile:/opt/nifi/nifi-current/flowfile_repository
      - nifi_1_provenance:/opt/nifi/nifi-current/provenance_repository
    healthcheck:
      test: ["CMD", "curl", "-f", "-k", "https://localhost:8443/nifi-api/system-diagnostics"]
      interval: 10s
      timeout: 5s
      retries: 5

  nifi-2:
    image: apache/nifi:2.8.0
    container_name: nifi-2
    depends_on:
      zookeeper:
        condition: service_healthy
    environment:
      - NIFI_CLUSTER_IS_NODE=true
      - NIFI_ZK_CONNECT_STRING=zookeeper:2181
      - NIFI_CLUSTER_ADDRESS=nifi-2
      - NIFI_CLUSTER_NODE_ADDRESS=nifi-2
      - NIFI_CLUSTER_NODE_PROTOCOL_PORT=8080
      - SINGLE_USER_CREDENTIALS_USERNAME=admin
      - SINGLE_USER_CREDENTIALS_PASSWORD=admin123456
      - NIFI_JVM_HEAP_INIT=2g
      - NIFI_JVM_HEAP_MAX=2g
    ports:
      - "8444:8443"
      - "8081:8080"
    volumes:
      - nifi_2_content:/opt/nifi/nifi-current/content_repository
      - nifi_2_flowfile:/opt/nifi/nifi-current/flowfile_repository
      - nifi_2_provenance:/opt/nifi/nifi-current/provenance_repository
    healthcheck:
      test: ["CMD", "curl", "-f", "-k", "https://localhost:8443/nifi-api/system-diagnostics"]
      interval: 10s
      timeout: 5s
      retries: 5

  nifi-3:
    image: apache/nifi:2.8.0
    container_name: nifi-3
    depends_on:
      zookeeper:
        condition: service_healthy
    environment:
      - NIFI_CLUSTER_IS_NODE=true
      - NIFI_ZK_CONNECT_STRING=zookeeper:2181
      - NIFI_CLUSTER_ADDRESS=nifi-3
      - NIFI_CLUSTER_NODE_ADDRESS=nifi-3
      - NIFI_CLUSTER_NODE_PROTOCOL_PORT=8080
      - SINGLE_USER_CREDENTIALS_USERNAME=admin
      - SINGLE_USER_CREDENTIALS_PASSWORD=admin123456
      - NIFI_JVM_HEAP_INIT=2g
      - NIFI_JVM_HEAP_MAX=2g
    ports:
      - "8445:8443"
      - "8082:8080"
    volumes:
      - nifi_3_content:/opt/nifi/nifi-current/content_repository
      - nifi_3_flowfile:/opt/nifi/nifi-current/flowfile_repository
      - nifi_3_provenance:/opt/nifi/nifi-current/provenance_repository
    healthcheck:
      test: ["CMD", "curl", "-f", "-k", "https://localhost:8443/nifi-api/system-diagnostics"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  nifi_1_content:
  nifi_1_flowfile:
  nifi_1_provenance:
  nifi_2_content:
  nifi_2_flowfile:
  nifi_2_provenance:
  nifi_3_content:
  nifi_3_flowfile:
  nifi_3_provenance:

Start the cluster:

docker compose up -d

Wait for health checks to pass:

docker compose ps

Expected output (all services should show healthy after 30–60 seconds):

CONTAINER ID   IMAGE              STATUS          PORTS
...            nifi-1             healthy        ...
...            nifi-2             healthy        ...
...            nifi-3             healthy        ...

Access the cluster:

Open https://localhost:8443/nifi (nifi-1), https://localhost:8444/nifi (nifi-2), or https://localhost:8445/nifi (nifi-3). All should show the same flow. Log in with admin/admin123456.

Verify cluster membership:

Top-right menu → Cluster
You should see three nodes listed
One will be marked Primary Node, one is Cluster Coordinator

Monitor ZooKeeper activity:

docker compose logs -f zookeeper | grep -E "follower|leader|quorum"

Tear down:

docker compose down -v

4.2 Git-Based Flow Versioning with Flow Registry

NiFi 2.x has native Git integration (NiFi Registry, the separate service, is deprecated).

Prerequisites:

A Git repository (GitHub, GitLab, Gitea, or self-hosted)
SSH key pair for Git access (or PAT token if using HTTPS)

Step 1: Create a Git Repository

On GitHub (or your Git provider), create a new repository named nifi-flows:

git init nifi-flows
cd nifi-flows
echo "# NiFi Flow Repository" > README.md
git add README.md
git commit -m "Initial commit"
git remote add origin git@github.com:your-org/nifi-flows.git
git push -u origin main

Step 2: Configure SSH Key Controller Service

In NiFi UI:

Top-right menu → Controller Settings → Controller Services tab
Click + to add a new service
Choose StandardPrivateKeyCredentialsProvider
Configure:
- Username: git (or your GitHub username)
- Private Key Path: /home/nifi/.ssh/id_ed25519 (path to your SSH private key)
- Known Hosts Path: /home/nifi/.ssh/known_hosts (populate by running ssh-keyscan github.com >> known_hosts)
Enable the service (lightning bolt icon)

Step 3: Create a Flow Registry Client

Top-right menu → Controller Settings → Registry Clients tab
Click +
Choose GitFlowRegistryClient
Configure:
- Name: GitHub
- Flow Storage Directory: /opt/nifi-flows (NiFi will initialize this as a Git repo)
- Remote Clone Repository: git@github.com:your-org/nifi-flows.git
- Branch: main
- Remote Access Credentials: Select your PrivateKeyCredentialsProvider service

Step 4: Version a Process Group

Create a Process Group on your canvas (or use an existing one)
Right-click the Process Group → Version
Click Start Version Control
Select registry client: GitHub
Fill in:
- Bucket: default (maps to a subdirectory in the repo)
- Flow Name: my-hpc-flow
- Flow Description: Describe your flow
Click Save

NiFi will:

Create /opt/nifi-flows/default/my-hpc-flow.json (the flow definition)
Push it to the Git repo
Track changes from now on

Step 5: Commit Changes

After modifying your flow:

Right-click the Process Group → Version → Commit Local Changes
Add a commit message
Click Commit

NiFi pushes the new version to Git with full history.

Step 6: Review and Deploy Across Environments

On another NiFi instance (test, prod):

Create a Flow Registry Client pointing to the same Git repo
Right-click canvas → Version → Import from Registry
Select your flow and version
Click Import

This enables environment promotion: develop on dev, commit, then import the same version to production.

4.3 Tuning NiFi for High Throughput

Scenario: You have 10 million small files to process. Default settings will be slow.

JVM Tuning (conf/bootstrap.conf):

# For 10 million files, you need significant heap
java.arg.2=-Xms16g
java.arg.3=-Xmx16g

# Tune GC for low latency
java.arg.4=-XX:+UseG1GC
java.arg.5=-XX:MaxGCPauseMillis=50
java.arg.6=-XX:+ParallelRefProcEnabled

Repository Tuning (conf/nifi.properties):

# Point repositories to NVMe for speed
nifi.flowfile.repository.directory=/mnt/nvme/flowfile_repository
nifi.content.repository.directory.default=/mnt/nvme/content_repository
nifi.provenance.repository.directory.default=/mnt/nvme/provenance_repository

# Increase write buffer for content repo
nifi.content.claim.max.appendable.size=10MB

# Compress provenance to save disk I/O
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.rollover.time=15 mins

# Disable full-text indexing of provenance if not needed
nifi.provenance.repository.index.shard.size=500MB

Processor Tuning:

For ListFile (the bottleneck):

Concurrent Tasks:  8 (higher parallelism per node)
Yield Duration:    0 secs (no backoff when directory is empty)
Batch Size:        100 (emit 100 FlowFiles per execution)

For FetchFile/PutFile (I/O heavy):

Concurrent Tasks:  16 (I/O bound; leverage parallelism)
Yield Duration:    100 ms (small backoff when no work)

Connection Back-Pressure:

Back-Pressure Object Threshold:  100,000 FlowFiles
Back-Pressure Data Size Threshold: 5 GB

Allows ListFile to stay ahead without growing unbounded.

Cluster Scaling:

3-node cluster → each FetchFile processor runs on all 3 nodes → 3× throughput
ListFile (Primary Node only) → runs on 1 node; scale by increasing Concurrent Tasks
Layout: Keep ListFile output queued small; push work downstream

4.4 Implementing Sensitive Properties and Secrets Management

Scenario: You have Ceph credentials, API keys, and database passwords to manage securely.

Step 1: Set the Sensitive Properties Key

Before storing any credentials, set nifi.sensitive.props.key in conf/nifi.properties:

nifi.sensitive.props.key=generated-32-character-random-key-here

Generate a secure key:

openssl rand -hex 16  # outputs 32 hex characters

Important: This key is used for AES-GCM-256 encryption. Changing it will lock you out of all stored credentials. Back it up securely.

Step 2: Create a Parameter Context for Secrets

Top-right menu → Parameter Contexts → +
Name it HPC Secrets
Add parameters:
- ceph.access.key (sensitive) — your Ceph RadosGW access key
- ceph.secret.key (sensitive) — your Ceph RadosGW secret key
- slurm.rest.api.token (sensitive) — if using slurmrestd
Click Create

Step 3: Reference in Processors

In any processor property, instead of hardcoding credentials, use:

#{ceph.access.key}
#{ceph.secret.key}

These are resolved at runtime and the actual values never appear in logs or the flow definition.

Step 4: Secure Parameter Context Access

Parameter Contexts can be assigned at the Process Group level:

Right-click Process Group → Configure
Go to Parameter Context tab
Select HPC Secrets

Only processors in that group (and children) can reference those parameters.

Step 5: Using External Secret Stores (Advanced)

For production, consider integrating with HashiCorp Vault:

Write a custom Controller Service that pulls secrets from Vault at runtime
Instead of storing in Parameter Contexts, call Vault's API
Credentials never touch NiFi's disk; pulled on-demand and cached in memory

Implementation sketch:

// Pseudo-code for VaultCredentialsProviderService
@ControllerService
public class VaultCredentialsProvider {
  private VaultClient client;
  
  public String getSecret(String key) {
    return client.read("secret/data/nifi/" + key).getValue();
  }
}

4.5 Advanced Processor Patterns for HPC Workflows

Pattern 1: Trigger External Compute Jobs and Wait for Completion

Flow structure:

ListDatasets → FetchDataset → TriggerJobViaHttp → PollJobStatus (looped) → ArchiveResults

TriggerJobViaHttp (InvokeHTTP):

HTTP Method:  POST
Remote URL:   http://slurmrestd:6820/slurm/v0.0.42/job/submit
Body:         {"job": {"script": "#!/bin/bash...", ...}}
Headers:      Content-Type: application/json

Extract job ID from response using EvaluateJsonPath:

slurm.job.id:  $.job_id

PollJobStatus (ExecuteStreamCommand in a loop):

Command:           /bin/bash
Command Arguments: -c;curl -s http://slurmrestd:6820/slurm/v0.0.42/job/${slurm.job.id}

Extract state with EvaluateJsonPath:

slurm.job.state:  $.job[0].job_state

RouteOnAttribute:

Route	Condition
`complete`	`${slurm.job.state:equals('COMPLETED')}`
`failed`	`${slurm.job.state:in('FAILED','CANCELLED','TIMEOUT')}`
`polling`	`${slurm.job.state:in('PENDING','RUNNING')}`

Connect polling back to a Wait processor (60s delay) to create a polling loop.

Pattern 2: Multi-Cluster Data Staging

Flow structure:

ListS3(ceph) → FetchS3Object → StageToNVMe → InvokeHTTP(notify-remote-cluster) → DeleteS3Object

Use parameter contexts to switch between clusters:

Cluster A: ceph.endpoint=http://ceph-a:7480, stage.path=/nvme/ceph-ingest
Cluster B: ceph.endpoint=http://ceph-b:7480, stage.path=/nvme/ceph-ingest

Assign the appropriate context to the Process Group — same flow works for both clusters.

Pattern 3: Error Recovery with Dead Letter Routing

Instead of terminating failures, route them to a Dead Letter Process Group:

MainFlow (success) → ProcessData
MainFlow (failure) → DeadLetterGroup

In DeadLetterGroup:

Input → LogError → StoreToArchive → NotifyAdmins

Use Data Provenance to replay from the dead letter queue once the issue is fixed:

View Data Provenance in DeadLetterGroup
Select a failed FlowFile
Right-click → Replay → choose the processor to replay from

5. Practical Examples

Example 1 — Building a Three-Layer HPC Pipeline with Parameter Contexts

Scenario: Archive simulation outputs from multiple clusters to Ceph, trigger post-processing jobs, and notify users.

Architecture:

Layer 1 (Source): ListFile (per-cluster)
Layer 2 (Archive): FetchFile → UpdateAttribute → PutS3Object
Layer 3 (Compute): TriggerJob → PollStatus → ArchiveResults
Layer 4 (Notify):  PutSlack + PutEmail

Parameter Context: ArchiveConfig

cluster.name = cluster-a
source.path = /beegfs/scratch/outputs
s3.bucket = hpc-archive
s3.endpoint = http://ceph-rgw:7480

UpdateAttribute adds metadata:

archive.time = ${now():format('yyyy-MM-dd HH:mm:ss')}
archive.cluster = ${cluster.name}
archive.path = ${s3.bucket}/${archive.cluster}/${now():format('yyyy/MM/dd')}/${filename}

PutS3Object uses these:

Object Key: ${archive.path}
Endpoint Override URL: ${s3.endpoint}

Switching to Cluster B: only change the Parameter Context values. The flow stays the same.

Example 2 — Handling Failures and Retries

Scenario: A network transfer occasionally fails. Retry up to 3 times before alerting.

Flow:

FetchFile → PutSFTP → Router
  success → Archive
  failure → UpdateRetryCount → CheckRetries
    (if < 3) → Wait(30s) → FetchFile (loop)
    (if >= 3) → AlertAdmin

UpdateRetryCount:

retry.count = ${retry.count:isEmpty():ifTrue(1):ifFalse(${retry.count:toNumber():plus(1)})}

(If no retry.count attribute, start at 1; otherwise increment)

CheckRetries (RouteOnAttribute):

retry_again = ${retry.count:toNumber():lt(3)}
max_retries = ${retry.count:toNumber():ge(3)}

Wait processor:

Release After: 30 secs

When the FlowFile comes back through FetchFile, it has retry.count = 2 and will retry.

Example 3 — Dynamic Batch Processing

Scenario: Accumulate files until you have at least 10 or 5 minutes have passed, then process as a batch.

Flow:

ListFile → UpdateAttribute(batch.id) → MergeContent → ExecuteScript(process-batch.sh)

MergeContent:

Merge Strategy: Bin Packing
Minimum Number of Files: 10
Maximum Bin Age: 5 mins
Correlation Attribute Name: batch.id

This accumulates up to 10 files with the same batch.id before merging them into a single FlowFile (a TAR or ZIP archive of all inputs).

ExecuteScript unpacks the archive, processes it, and outputs a result FlowFile.

6. Hands-On Exercises

Exercise 1 — Cluster Failover Testing

Start the 3-node Docker cluster (see Section 4.1)
Create a simple flow: ListFile → LogAttribute
View the Cluster page (top-right menu → Cluster)
Note which node is Primary
Stop the Primary Node: docker compose stop nifi-1 (or whichever is Primary)
Watch the logs: docker compose logs -f — look for "Primary Node election"
After 30–60 seconds, check Cluster page again — a new Primary should be elected
Drop files into the watch directory — they should still be processed by the remaining nodes
Restart the failed node: docker compose up -d nifi-1
Confirm it rejoins the cluster (Cluster page shows 3 nodes again)

Exercise 2 — Expression Language Mastery

Create a flow with UpdateAttribute that:
- Takes a filename like dataset-2026-04-10-001.csv
- Extracts the date: extract.date = ${filename:substringAfter('-'):substringBefore('-'):regex('(\\d{4}-\\d{2}-\\d{2})', '$1')}
- Extracts the sequence number: extract.seq = ${filename:substringAfterLast('-'):substringBefore('.')}
- Builds a routing key: route.key = ${extract.date}/${extract.seq}
Log these attributes and verify they match your expectations
Use the routing key in a RouteOnAttribute processor to branch to different directories

Exercise 3 — Performance Analysis with Status History

Build a flow with ListFile → FetchFile → PutFile
Drop 1000 files into the input directory
Start the flow and watch the throughput
Right-click the ListFile → FetchFile connection → View Status History
Graph: bytes in, bytes out, FlowFiles in, FlowFiles out over time
Increase ListFile's Concurrent Tasks from 1 to 4 to 8
Observe the performance impact on the status history graph
Find the optimal setting (CPU saturation vs. throughput)

Exercise 4 — Dead Letter Routing and Recovery

Create a flow that intentionally fails some FlowFiles (e.g., RouteOnAttribute with a bad path)
Route failures to a DeadLetterGroup (separate Process Group)
In DeadLetterGroup, log the failed FlowFile but don't terminate it
Using Data Provenance, find a failed FlowFile
Right-click → Replay → replay from the original processor
Verify the replayed FlowFile goes through the flow again
Fix the underlying issue (e.g., correct the path)
Verify the replayed FlowFile now succeeds

7. Troubleshooting

ZooKeeper Quorum Loss

Symptom: Cluster goes read-only; you see "Insufficient connected nodes to form quorum" in logs. Cause: More than half of ZK nodes are down. Fix:

Bring ZK nodes back online immediately
Alternatively, restart all NiFi nodes after ZK is healthy (they will rejoin)
Prevent this by monitoring ZK separately and alerting on node failures

Primary Node Stuck in Election

Symptom: No Primary Node elected; processors with Primary Node isolation don't run. Cause: ZK connectivity issue or network partition between nodes Fix:

Verify ZK is running and all NiFi nodes can reach it
Check firewall/network — NiFi nodes need to communicate on port 11443 (cluster protocol port)
Check logs for ZK session timeout errors

Content Repository Growing Unbounded

Symptom: Disk space on content repository partition fills up Cause: FlowFiles are being created but not cleaned up; dead FlowFiles aren't being claimed Fix:

Check for hanging flows (processes that never terminate FlowFiles)
Monitor queue depths — if any connection has millions of FlowFiles, investigate
Run a provenance query for orphaned FlowFiles — these should be rare
Increase provenance retention cleanup frequency in nifi.properties

Slow Flow Registry Pushes

Symptom: Committing a flow to Git takes 30+ seconds Cause: Large flow JSON files; slow Git push over network Fix:

Break large flows into smaller Process Groups (versioned separately)
Use local Git server instead of GitHub for lower latency
Configure Git's SSH connection to use compression: add to ~/.ssh/config:
```
Host github.com
  Compression yes
  CompressionLevel 6
```

Expression Language Returns Unexpected Null

Symptom: EL evaluates to null; processor routes to unmatched Cause: Attribute doesn't exist, function failed, or null propagated through chain Fix:

Use LogAttribute to dump all attributes and verify the one you're referencing exists
Use ${attr:isEmpty()} instead of ${attr:equals('')} to handle null
Use try-catch style: ${attr:trim():isEmpty():ifTrue('EMPTY'):ifFalse(${attr})}

Cluster Coordinator Change Causes Flow Disruption

Symptom: Processors pause briefly when Coordinator election occurs Cause: During Coordinator election, the cluster is briefly unavailable Fix:

This is normal and unavoidable; design for brief pauses
Use connection back-pressure to handle temporary queue buildup
Deploy ZK separately from NiFi to stabilize Coordinator role

8. References

Resource	URL
Official NiFi Administration Guide	https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
NiFi Cluster Setup Guide	https://nifi.apache.org/docs/nifi-docs/html/setup-wizard.html
Expression Language Reference	https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
Processor Documentation	https://nifi.apache.org/components/
NiFiKop Kubernetes Operator	https://konpyutaika.github.io/nifikop/
Apache ZooKeeper Documentation	https://zookeeper.apache.org/doc/current/
Slurm REST API	https://slurm.schedmd.com/rest.html
Prometheus NiFi Exporter	https://github.com/prometheus-community/nifi_exporter

9. Summary

Key takeaways:

Cluster mode multiplies throughput by running on multiple nodes, coordinated by ZooKeeper and managed by a Primary Node and Coordinator
FlowFiles and content are durable — repositories on fast storage guarantee no data loss even during node failures
Expression Language is powerful and dynamic — use it to build flexible, parameterizable flows that work across environments
Back-pressure prevents resource exhaustion — tune queue limits based on file size and target throughput
Git-based versioning enables environment promotion — develop, commit, then deploy the same flow to production
Security is layered — sensitive properties encrypted at rest, parameter contexts restrict credential scope, capabilities limit processor privileges
Monitoring and observability — Prometheus metrics, status history graphs, and data provenance are essential for production operations
Patterns matter — understand retry, aggregation, routing, and external job integration patterns to solve real-world problems

Next steps:

Deploy to [[kubernetes-deep-dive|Kubernetes]] for cloud-native operations and auto-scaling
Integrate with [[apache-nifi-hpc-sysadmin-deep-dive]] for HPC-specific workflows
Explore custom processors and Controller Services for domain-specific logic
Set up automated monitoring and alerting with Prometheus + Grafana
Review your organization's data governance policies and implement audit trails to satisfy compliance

[[apache-nifi-beginner-guide]]
[[apache-nifi-hpc-sysadmin-deep-dive]]
[[kubernetes-deep-dive]]

1. Overview​

2. Prerequisites​

3. Key Concepts​

3.1 Cluster Architecture and Coordination​

3.2 ZooKeeper and Leader Election​

3.3 Primary Node vs. All-Nodes Processors​

3.4 Expression Language Internals​

3.5 Controller Service and Property Encryption​

3.6 Back-Pressure and Queue Management​

4. Step-by-Step Instructions​

4.1 Setting Up a 3-Node Cluster with Docker Compose​

4.2 Git-Based Flow Versioning with Flow Registry​

4.3 Tuning NiFi for High Throughput​

4.4 Implementing Sensitive Properties and Secrets Management​

4.5 Advanced Processor Patterns for HPC Workflows​

5. Practical Examples​

Example 1 — Building a Three-Layer HPC Pipeline with Parameter Contexts​

Example 2 — Handling Failures and Retries​

Example 3 — Dynamic Batch Processing​

6. Hands-On Exercises​

Exercise 1 — Cluster Failover Testing​

Exercise 2 — Expression Language Mastery​

Exercise 3 — Performance Analysis with Status History​

Exercise 4 — Dead Letter Routing and Recovery​

7. Troubleshooting​

ZooKeeper Quorum Loss​

Primary Node Stuck in Election​

Content Repository Growing Unbounded​

Slow Flow Registry Pushes​

Expression Language Returns Unexpected Null​

Cluster Coordinator Change Causes Flow Disruption​

8. References​

9. Summary​

Related Tutorials​