Jump to content

Why fail2ban Is Security Theater

From Pulsed Media Wiki


Pulsed Media has operated dedicated servers and seedbox infrastructure since 2010 — thousands of servers over 15+ years. Zero run fail2ban. Zero SSH brute force compromises. This page explains why, with real data from production systems.

This article is part of a series on common sysadmin habits that waste time and create problems. See also: Why Blocking ICMP Echo Requests Breaks More Than It Fixes.

What fail2ban does

fail2ban watches auth.log for failed login attempts. When an IP exceeds a threshold (default: 5 failures), it adds a firewall rule to block that IP for a set duration (default: 10 minutes).

It is a Python daemon that runs continuously, requires configuration files (jail.local, filter.d/, action.d/), consumes 30–50 MB of RAM, and parses attacker-controlled log content.

The standard checklist

Every "SSH hardening" guide repeats the same steps:

  1. Install fail2ban
  2. Disable password authentication
  3. Change the SSH port
  4. Block ICMP echo requests

Four steps. Fifteen minutes. Four new failure modes. None of them address a real threat.

We call these "security vibes" — measures that produce the feeling of security without the substance.

Why fail2ban does not help

Botnets rotate IPs faster than fail2ban bans them

Modern SSH brute force comes from botnets with tens of thousands of source IPs. Each IP tries 2–3 passwords, then moves on. With a threshold of 5 failures, most botnet IPs never trigger a ban.

One week of auth.log data from a production mail relay:

  • 68,327 failed SSH attempts
  • 1,319 unique source IPs
  • 462 IPs (35%) made 4 or fewer attempts — below any reasonable fail2ban threshold

fail2ban catches the top offenders (one IP tried 5,676 times). The distributed long tail flows through untouched.

fail2ban has its own overhead

For every IP it bans, fail2ban writes log entries: 5 "Found" lines (one per threshold attempt), a "Ban" line, and later an "Unban" line. Seven entries per ban cycle, roughly 750 bytes.

From real data (1,093 bans per week):

Source Entries/week Size/week
"Found" detection lines 5,465 ~655 KB
Ban/Unban lines 2,186 ~196 KB
Total fail2ban.log ~7,651 ~850 KB/week

The log overhead is modest. But fail2ban also generates syslog entries for each firewall action, and its Python process sits in RAM on a server where every megabyte matters.

Firewall rule churn

Each ban inserts a firewall rule. Each unban removes it. With the default bantime of 600 seconds, only 10–25 rules exist at any moment — not a performance problem on its own.

But the churn is constant: 1,093 insertions and 1,093 deletions per week. Each rule modification forces the kernel to re-evaluate the entire rule chain for every subsequent packet. On a mail relay handling thousands of concurrent SMTP connections, this constant churn adds measurable CPU overhead at ban/unban boundaries.

When someone extends bantime to 24 hours (a common "improvement"), rules accumulate — 1,093 bans per week with 24-hour expiry means ~156 concurrent active rules at any moment (one day's worth of bans). Still manageable, but each rule adds to the per-packet evaluation cost.

It is a service that can fail

fail2ban has had CVEs. It parses attacker-controlled input (auth.log content), which is a risky surface by design. When fail2ban crashes, protection drops to zero — and without separate monitoring for the fail2ban process itself, nobody notices.

It bans your own infrastructure

We discovered this firsthand in March 2026. A backup system needed restricted SSH access from a management server. The first connection attempt: "Connection refused."

fail2ban had banned the IP of our own production server. A failed key negotiation during setup triggered the threshold, and fail2ban blocked the IP — no distinction between a botnet and our own infrastructure.

Fifteen minutes of operator time went into diagnosing "Connection refused" on a system that had just been configured. The firewall rules were clean. SSH was listening. Everything looked correct from inside the server. The ban was invisible without checking fail2ban specifically.

This is not an edge case. This is the failure mode fail2ban is designed to produce. It blocks IPs without context — yours, your monitoring system's, your customer's. A pattern matcher reading auth.log cannot tell the difference.

The one-line fix

The actual problem fail2ban claims to solve is log growth. SSH brute force generates auth.log entries. On a small disk, those entries fill the disk. When the disk fills, other services die.

The fix is one line in your logrotate configuration:

maxsize 50M

auth.log is capped at 50 MB per rotation. With rotate 4, the maximum total is approximately 200 MB. The disk cannot fill from SSH logs. The problem is gone. No daemon, no configuration files, no Python dependency, no CVE surface, no firewall rules, no RAM, no failure modes, no risk of banning your own servers.

logrotate ships with every Linux system. It runs from cron as a stateless process. Nothing to maintain, nothing to monitor, nothing to break.

Scenario Auth.log growth/week Complexity Failure modes
No fail2ban 20 MB None Log fills disk eventually
With fail2ban ~2 MB High (daemon, config, rules) Service crash, bans legitimate IPs, CVEs
logrotate maxsize 50M Capped at 50 MB total None None

The brute force math

When you suggest removing fail2ban, the first question is always: "But what about the brute force itself?"

The assumption behind that question: SSH brute force is a real threat that requires active countermeasures. Check the assumption with arithmetic.

A 12-character alphanumeric password has a keyspace of 62 characters (a–z, A–Z, 0–9). Total combinations: 6212 = 3.22 × 1021.

OpenSSH enforces a delay of approximately 3 seconds per authentication attempt through PAM's pam_faildelay, TCP handshake overhead, and key exchange time. The default MaxStartups setting (10:30:100) limits concurrent unauthenticated connections — but even granting an attacker 1,000 parallel connections:

3.06 × 1011 years.

Three hundred billion years. The universe is 13.8 billion years old.

This is not a threat. It is background noise. The auth.log entries it generates are the only actual problem — and logrotate maxsize handles those permanently.

SSH keys are even stronger (256-bit ed25519 keys have a keyspace that makes the password calculation look quaint), but the point is that even passwords are not breakable through brute force against OpenSSH's built-in rate limiting.

Credential stuffing is a different problem

The math above applies to random brute force against a strong, unique password. If you chose "Summer2024!" for SSH and it was leaked in a database breach, no keyspace mathematics will save you. Credential stuffing — trying known username/password pairs from breached databases — is a real threat.

The defense is not fail2ban. fail2ban might block an attacker after 5 attempts, leaving them 4 more tries with your actual leaked password. The defense is a strong, unique password that does not appear in any breach database. fail2ban cannot help with strong passwords (unnecessary) and cannot reliably help with weak ones (too late).

What about key-only SSH?

"Disable password authentication" is the second item on every hardening checklist. It has locked operators out of their own servers.

An SSH key is a very long password. A 12-character random password provides the same practical security against brute force as a 4096-bit RSA key — both are impossible to guess through OpenSSH's rate-limited authentication.

Key-only SSH adds value when you have many users and want to avoid password distribution, when you need automated machine-to-machine authentication, or when you want to prevent password reuse.

Key-only SSH costs you when you lose your key and cannot access the server, when a new team member needs emergency access, or when you are working from a device without your key.

For a small team managing their own infrastructure, key-only SSH is a tradeoff. Evaluate it for your situation rather than applying it because a checklist told you to.

A real cascade failure

This is not theoretical. Here is what happened to our mail relay infrastructure:

  1. SSH brute force generated auth.log entries on a 10 GB server
  2. Logrotate tried to compress auth.log — compression failed because the disk was too tight for the temporary file
  3. After the failed compression, logrotate could not complete its rotation cycle
  4. auth.log grew unbounded to 8.6 GB on a 10 GB disk
  5. Disk filled to 100% — the mail server (exim4) could not write to its spool
  6. exim4 died silently with no alert
  7. DNS round-robin hid the dead backend — the remaining relays absorbed load
  8. When the last surviving relay was disabled for a separate issue, all outbound email stopped

The cascade ran undetected for years. Logrotate handled auth.log normally for a long time — the server was commissioned around 2016. At some point, disk usage from other services tightened enough that logrotate's compression step failed (a 0-byte .gz file is the forensic signature), and from that moment auth.log grew unbounded. The failure was invisible because the operator's own email kept working through a surviving relay on a different destination domain.

The fix was not fail2ban. fail2ban would have reduced auth.log growth from ~20 MB/week to ~2 MB/week — but the disk was already too tight for logrotate to compress existing logs. The cascade started at step 2 (compression failure), not step 1 (log growth). And once rotation breaks, even 2 MB/week grows unbounded on a 10 GB disk. It just takes longer to reach the same result.

maxsize 50M prevents the cascade at step 2: it rotates by size, not just by schedule, so compression never races against available disk space.

The cargo cult

Richard Feynman described cargo cult science in his 1974 Caltech commencement address. During World War II, populations in Melanesia observed American military bases receiving supply drops. After the war, they built bamboo control towers, carved wooden headphones, lit fires along dirt runways, and marched in formation. The rituals were precise. But no planes landed.

The form was perfect. The substance was absent.

fail2ban reproduces the visible form of security — a daemon running, IPs being blocked, logs showing "Ban" entries. It looks like protection. The metrics confirm activity. But SSH brute force against a strong credential was never a threat. The planes were never going to land.

Installing fail2ban because "that's what you do on a server" follows the same pattern as building bamboo headphones because "that's what the Americans wore in the tower." Both adopt the form of a successful system without understanding why it worked.

The mechanism that secures SSH is mathematics. 6212 combinations at 3 seconds each. No daemon improves on arithmetic.

What actually secures SSH

After 16 years of operating thousands of servers:

  1. Strong passwords (12+ characters, random) — not breakable through OpenSSH rate limiting
  2. logrotate maxsize — caps log growth permanently, zero complexity
  3. Patching — keep OpenSSH updated for actual vulnerability fixes
  4. Network architecture — restrict SSH access to known IPs via firewall when the server does not need public SSH

Four things. Three of which are already the default on a well-maintained server.

Before installing any security tool, ask one question: what specific threat does this counter, and is that threat real in my environment?

SSH brute force against a 12-character random credential is not a real threat. Every tool designed to counter it — fail2ban, port knocking, changed SSH ports, ICMP echo blocking — is solving a problem that does not exist. And each one has configuration complexity, failure modes, and operational surprises (like banning your own infrastructure at the worst moment).

At Pulsed Media

On every Pulsed Media server, the SSH security model is:

  • OpenSSH with default settings and strong passwords
  • logrotate with maxsize directive
  • Regular patching

No fail2ban. No changed ports. No ICMP blocking. Thousands of servers over 16 years. Zero SSH compromises from brute force.

PM manages its infrastructure with PMSS (open source at https://github.com/MagnaCapax/PMSS), running from its own datacenters in Finland. The complexity budget goes into things that matter — faster storage, better network, reliable service — not into security theater that creates more problems than it solves.

Every hour spent configuring fail2ban, debugging why it banned your own monitoring server, and maintaining its jail files is an hour not spent on patching, monitoring, or improving the actual service. The math on that tradeoff is as simple as the brute force math: zero benefit, non-zero cost.

PM's seedbox plans include SSH access on all tiers, including the permanent free tier.

See also