Bash is the default language of Linux administration. It's everywhere, it's fast to write, and it runs without installing anything. But there's a point -- and every sysadmin eventually crosses it -- where a Bash script becomes harder to maintain than the problem it was solving. Variables that look like other variables. Error handling that requires manual return-code checks at every step. String parsing that gets weird the moment a filename has a space in it. That's when Python steps in.

Python has been pre-installed on virtually every major Linux distribution for over a decade. Ubuntu 24.04 LTS ships Python 3.12; RHEL 9 ships Python 3.9; Debian 13 "Trixie" ships Python 3.13. It's not a tool you have to justify bringing in -- it's already there. What makes it uniquely suited to systems work is the combination of a readable, explicit syntax with a standard library that was clearly written by people who had to manage real servers.

This guide is not an introduction to Python syntax. It assumes you already know how to write a function and handle an exception. Instead, it focuses on the libraries and patterns that are directly useful for Linux system administration: process management, file operations, system monitoring, user management, log parsing, and remote execution. It also covers the patterns that separate scripts you trust from scripts you merely hope work -- how to design proper CLIs, write automation that's safe to re-run, and test privileged code without touching a live system. The goal is to give you working, production-quality patterns you can adapt immediately.

Why Python Over Bash for Admin Work

The question comes up constantly: "Why not just use Bash?" The answer isn't that Bash is bad -- it's that Bash has a narrow optimum. For short pipelines, one-liners, and quick glue between system tools, Bash is genuinely the right tool. For anything that needs data structures, error propagation, testability, or reuse, Python is substantially better.

"Python's strength as a system administration language comes from the fact that it treats automation as a programming problem rather than a shell problem -- you get real data structures, real exception handling, and real testing." -- Noah Gift and Jeremy Jones, Python for Unix and Linux System Administration (O'Reilly, 2008)

That observation is just as accurate in 2026 as it was when it was written. Here is what the failure mode looks like in practice: a Bash provisioning script runs fine for six months. Then someone passes it a username with a hyphen in it. The variable expansion behaves unexpectedly, a directory gets created with the wrong name, a service account is half-provisioned, and the next script in the chain fails with an error that points to a symptom three steps downstream from the actual cause. Nobody added error handling at each step because that would have doubled the script's length. Nobody wrote a test because there's no clean way to test shell scripts without running them. Python does not eliminate this class of problem by magic -- but it gives you the tools to handle it explicitly rather than finding out through a production failure.

The practical differences that matter most for systems work are: Python gives you actual dictionaries, lists, and objects to work with rather than streams of text; exceptions propagate automatically instead of requiring manual $? checks; and the code you write today is something you can read and trust a year later.

One important clarification before going further: Python is not a replacement for Ansible, Puppet, Chef, or Terraform for large-scale configuration management. Those tools exist because configuration state at scale is a hard problem and they solve it well. Python fills the space between "a five-line Bash script" and "a full Ansible role" -- automation that is too complex for shell but too specific or lightweight to warrant a full CM tool.

The Standard Library Worth Knowing

Before reaching for any third-party package, the Python standard library covers a surprising amount of sysadmin territory. The modules below are the ones that make the difference between Python scripts that behave correctly and ones that fail in the specific, inconvenient ways that Bash scripts fail -- at the character boundary, the encoding edge case, the file that doesn't exist yet. Three modules in particular are worth understanding deeply.

Note

Always use #!/usr/bin/env python3 as your shebang line -- not #!/usr/bin/python3. The env form resolves to whichever python3 is first on the current PATH, which means it picks up a virtual environment's interpreter correctly. A hardcoded path will bypass the venv and use the system Python instead, with the wrong set of installed packages. The code examples in this guide target Python 3.9+ for broad compatibility across RHEL 9, Ubuntu 22.04, and Debian 12. Features used from newer versions (like tomllib in 3.11) are noted explicitly where they appear.

os and pathlib

The os module gives you access to operating system interfaces: environment variables, file permissions, process information, and directory traversal. For most path operations in modern Python (3.4+), pathlib is the cleaner choice -- it represents paths as objects rather than strings, which eliminates an entire class of string-concatenation bugs.

file_ops.py
import os
from pathlib import Path

# pathlib: build paths safely, no string concatenation
log_dir = Path("/var/log/myapp")
log_dir.mkdir(parents=True, exist_ok=True)

# Iterate log files older than 30 days and remove them
import time
cutoff = time.time() - (30 * 86400)

for log_file in log_dir.glob("*.log"):
    if log_file.stat().st_mtime < cutoff:
        log_file.unlink()
        print(f"Removed: {log_file}")

# os: set permissions explicitly
config_file = Path("/etc/myapp/secrets.conf")
os.chmod(config_file, 0o600)  # owner read/write only

# os: check effective UID (are we root?)
if os.geteuid() != 0:
    raise PermissionError("This script must run as root.")

The 0o600 notation is octal -- the same permission bits you'd pass to chmod on the command line. Using it directly in Python avoids the overhead of spawning a subprocess just to set permissions on a file.

subprocess

The subprocess module is how Python runs external commands. It replaces the older os.system() and os.popen() interfaces, both of which had significant security and reliability problems. The modern pattern is subprocess.run() with a list of arguments -- never a shell string unless you genuinely need shell expansion, and even then, you should know exactly why.

Caution

Never use shell=True with user-supplied input. This is a command injection vulnerability formally catalogued as CWE-78: Improper Neutralization of Special Elements used in an OS Command. Always use a list of arguments: subprocess.run(['useradd', username]), not subprocess.run(f'useradd {username}', shell=True). The list form bypasses the shell entirely -- no metacharacter interpretation, no command chaining, no injection vector.

subprocess_patterns.py
import subprocess

# Run a command, raise an exception if it fails
result = subprocess.run(
    ['systemctl', 'is-active', 'nginx'],
    capture_output=True,
    text=True,
    check=False  # we handle the return code ourselves
)

if result.returncode == 0:
    print("nginx is active")
else:
    print(f"nginx status: {result.stdout.strip()}")

# check=True raises CalledProcessError on non-zero exit
try:
    subprocess.run(
        ['useradd', '--system', '--no-create-home', 'appuser'],
        check=True
    )
except subprocess.CalledProcessError as e:
    print(f"useradd failed with exit code {e.returncode}")

# Capture output for parsing
df_output = subprocess.run(
    ['df', '-h', '/'],
    capture_output=True,
    text=True,
    check=True
)
print(df_output.stdout)

The capture_output=True flag (available since Python 3.7) is shorthand for stdout=PIPE, stderr=PIPE. The text=True flag decodes bytes to string using the system encoding. Together they give you a clean string you can work with directly, without manual .decode() calls.

One parameter that is easy to overlook but critical in production: timeout. By default, subprocess.run() waits indefinitely for the child process to finish. In automation scripts running under systemd timers, a hung subprocess will block the entire timer unit until it times out at the unit level -- which may be much later than you want. Pass timeout=30 (or whatever is appropriate) to any subprocess call that touches a network resource or a potentially blocked device, and catch subprocess.TimeoutExpired explicitly.

shutil

The shutil module handles higher-level file operations that os doesn't cover: copying files with metadata, moving entire directory trees, and creating compressed archives. For backup scripts in particular, it eliminates the need to shell out to cp -r or tar.

backup.py
import shutil
import logging
from pathlib import Path
from datetime import datetime

logger = logging.getLogger(__name__)

def backup_config_dir(source: str, backup_root: str) -> Path:
    src = Path(source)
    if not src.exists():
        raise FileNotFoundError(f"Source not found: {src}")

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    dest = Path(backup_root) / f"{src.name}_{timestamp}"

    # Copy entire directory tree, preserving metadata (timestamps, permissions)
    shutil.copytree(src, dest)
    logger.info("Backup created: %s", dest)
    return dest

def archive_logs(log_dir: str, archive_dir: str) -> str:
    """
    Create a .tar.gz archive of log_dir's contents.
    root_dir causes make_archive to chdir into log_dir first, so
    archive paths are relative (no leading /var/log/...) inside the tar.
    """
    timestamp = datetime.now().strftime("%Y%m%d")
    archive_path = Path(archive_dir) / f"logs_{timestamp}"
    shutil.make_archive(
        str(archive_path),
        'gztar',         # produces archive_path.tar.gz
        root_dir=log_dir  # archive contains contents of log_dir, not log_dir itself
    )
    return f"{archive_path}.tar.gz"

configparser, tomllib, and reading structured config

Sysadmin scripts constantly read configuration -- and the wrong way to do it is to parse config files with string splitting or regex. The standard library covers the three formats you encounter on Linux systems most often.

INI-style files (like smb.conf, pip.ini, many legacy daemons) are handled by configparser. It reads key-value sections, supports fallbacks and defaults, and handles comment stripping automatically.

read_config.py
import configparser

def load_app_config(path: str) -> configparser.ConfigParser:
    cfg = configparser.ConfigParser(interpolation=None)
    if not cfg.read(path):
        raise FileNotFoundError(f"Config file not found: {path}")
    return cfg

# Safely read a value with a fallback
cfg = load_app_config("/etc/myapp/myapp.ini")
timeout = cfg.getint("database", "timeout", fallback=30)
host = cfg.get("database", "host", fallback="localhost")

TOML files are increasingly common for modern tooling (pyproject.toml, various Rust and Go ecosystem configs that have migrated onto Linux servers). Python 3.11+ ships tomllib in the standard library. For older Python versions, install tomli as a drop-in.

read_toml.py
import sys
if sys.version_info >= (3, 11):
    import tomllib
else:
    import tomli as tomllib  # pip install tomli

def load_toml(path: str) -> dict:
    with open(path, "rb") as f:  # tomllib requires binary mode
        return tomllib.load(f)

config = load_toml("/etc/myapp/config.toml")
retention_days = config.get("logging", {}).get("retention_days", 30)

YAML is not in the standard library, but it appears constantly in infrastructure tooling (Ansible, Kubernetes, Docker Compose). If you must read YAML, use yaml.safe_load() from PyYAML -- never yaml.load() without the Loader argument, which can execute arbitrary Python via YAML tags and is a well-documented deserialization vulnerability.

Caution

Never use yaml.load(data) without an explicit Loader. The safe alternative is yaml.safe_load(data), which restricts deserialization to standard Python types and prevents execution of arbitrary code embedded in malicious YAML. This is especially important for scripts that read YAML supplied by external systems or user input.

System Monitoring with psutil

The standard library gets you far, but it stops short of one of the most common sysadmin needs: interrogating the live state of a running system. You can read /proc directly, or spawn ps and parse its text output, but both approaches are fragile in different ways. psutil (process and system utilities) solves this cleanly.

psutil is the single most useful third-party library for Linux sysadmin work. According to its documentation, it is a cross-platform library for retrieving information on running processes and system utilization -- CPU, memory, disks, network, and sensors -- in Python. It implements the functionality of command-line tools like ps, top, df, netstat, free, and uptime, but returns real Python objects instead of text.

Note

psutil (current version: 7.2.2, released January 28, 2026, authored by Giampaolo Rodolà) is among the top 100 most-downloaded Python packages on PyPI and supports CPython 3.6+ and PyPy. It is used in production by notable projects including TensorFlow, PyTorch, Home Assistant, Ansible, and Celery. Install it with pip install psutil. It is also available in the package repositories of all major distributions (python3-psutil on Debian/Ubuntu; python3-psutil on RHEL/Fedora).

The API covers five main areas: CPU, memory, disks, networking, and processes. Here is a practical monitoring function that checks all five and returns a structured report:

monitor.py
import psutil
from datetime import datetime

def system_snapshot() -> dict:
    """Return a structured snapshot of current system health."""
    cpu = psutil.cpu_percent(interval=1)
    mem = psutil.virtual_memory()
    swap = psutil.swap_memory()
    disk = psutil.disk_usage("/")
    net = psutil.net_io_counters()
    load = psutil.getloadavg()  # 1, 5, 15 minute averages

    return {
        "timestamp": datetime.now().isoformat(),
        "cpu_percent": cpu,
        "load_avg": {"1m": load[0], "5m": load[1], "15m": load[2]},
        "memory": {
            "total_gb": round(mem.total / 1e9, 2),
            "used_percent": mem.percent,
            "available_gb": round(mem.available / 1e9, 2),
        },
        "swap_percent": swap.percent,
        "disk_root": {
            "total_gb": round(disk.total / 1e9, 2),
            "used_percent": disk.percent,
            "free_gb": round(disk.free / 1e9, 2),
        },
        "net_bytes_sent": net.bytes_sent,
        "net_bytes_recv": net.bytes_recv,
    }

def find_top_processes(n: int = 5) -> list:
    """Return the top N processes by CPU usage."""
    # First pass: seed the cpu_percent counters (returns 0.0 for all on first call)
    for proc in psutil.process_iter(['pid', 'cpu_percent']):
        pass
    # Brief sleep, then second pass for accurate readings
    import time
    time.sleep(0.1)
    procs = []
    for proc in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_percent']):
        try:
            procs.append(proc.info)
        except psutil.NoSuchProcess:
            pass
    return sorted(procs, key=lambda p: p['cpu_percent'], reverse=True)[:n]

A critical detail: psutil.NoSuchProcess is not an edge case -- it is the normal state of affairs. Processes appear and disappear between the time you list them and the time you query their attributes. Any code iterating over processes must catch this exception or it will fail in production at an unpredictable moment.

The two-pass pattern in find_top_processes deserves explanation. psutil.cpu_percent() measures CPU usage as a delta between two calls -- the first call to any process's cpu_percent attribute always returns 0.0 because there is no prior measurement to compare against. The first loop seeds all the counters; after a brief sleep, the second loop reads values that reflect actual CPU activity over the interval. For a quick snapshot this is acceptable. For precision monitoring, use a longer sleep or call system_snapshot() (which uses psutil.cpu_percent(interval=1)) instead.

Pro Tip

Call psutil.cpu_percent(interval=1) with a non-zero interval rather than zero. With interval=0, the first call always returns 0.0 because it compares against the last call, which hasn't happened yet. A small interval forces a blocking measurement and gives you an accurate reading.

User and Permission Management

User management is where many sysadmins instinctively reach for subprocess and call useradd directly. That's the right choice for creating or modifying users -- those operations require invoking system commands. But reading user and group information is a different story. Python's pwd and grp modules provide direct access to the user and group databases without spawning subprocesses. For read operations -- checking if a user exists, listing group members, resolving UID to username -- they are faster and more reliable than parsing the output of getent or id.

users.py
import pwd
import grp
import os
import subprocess
import logging
from pathlib import Path

logger = logging.getLogger(__name__)

def user_exists(username: str) -> bool:
    try:
        pwd.getpwnam(username)
        return True
    except KeyError:
        return False

def group_members(groupname: str) -> list[str]:
    """Return the list of member usernames for a given group."""
    try:
        entry = grp.getgrnam(groupname)
        return entry.gr_mem
    except KeyError:
        raise ValueError(f"Group not found: {groupname}")

def create_service_account(username: str, home_dir: str = None):
    """Create a locked, no-login system account for a service."""
    if user_exists(username):
        logger.info("User '%s' already exists, skipping.", username)
        return

    cmd = ['useradd', '--system', '--shell', '/usr/sbin/nologin']
    if home_dir:
        cmd += ['--home-dir', home_dir, '--create-home']
    else:
        cmd += ['--no-create-home']
    cmd.append(username)

    subprocess.run(cmd, check=True)
    logger.info("Created system account: %s", username)

def get_uid_gid(username: str) -> tuple[int, int]:
    entry = pwd.getpwnam(username)
    return entry.pw_uid, entry.pw_gid

def fix_ownership(path: str, username: str):
    """Recursively set ownership of a path to a given user."""
    uid, gid = get_uid_gid(username)
    target = Path(path)
    os.chown(target, uid, gid)
    for item in target.rglob('*'):
        os.chown(item, uid, gid)

The pwd.getpwnam() call queries the system's Name Service Switch (NSS) layer, which means it works correctly with LDAP, NIS, and other directory services -- not just local /etc/passwd entries. This is an important distinction: parsing /etc/passwd directly, which many older scripts do, will fail to find directory users. Similarly, grp.getgrnam() queries the group database through the same NSS layer -- group_members() above works correctly whether groups are defined locally, in LDAP, or in Active Directory. As Christian Heimes noted in the Python-Dev mailing list discussion on module deprecation: "The pwd and grp modules use proper libc APIs that are internally backed by NSS... they automatically work with any configured user and group provider, even LDAP, IdM or Active Directory." The APIs are defined and standardized since POSIX.1-2001.

Log Parsing and Analysis

Logs are the primary evidence you have when something goes wrong. Python's text processing is significantly more capable than awk/sed for anything beyond simple line-by-line filtering, particularly when you need to correlate events across time, extract structured data, or count occurrences across large files.

log_analysis.py
import re
from pathlib import Path
from collections import Counter
from datetime import datetime

# Parse nginx access log for top IPs and 5xx errors
# Matches combined log format: IP - - [timestamp] "METHOD path HTTP" status size
NGINX_LOG_RE = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) \S+" '
    r'(?P<status>\d{3}) (?P<size>\d+)'
)
LOG_TIME_FMT = "%d/%b/%Y:%H:%M:%S %z"

def analyze_access_log(log_path: str, since: datetime = None) -> dict:
    """
    Parse an nginx access log. Optionally filter to entries after `since`.
    Returns top IPs, status distribution, and list of 5xx errors.
    """
    path = Path(log_path)
    ip_counter = Counter()
    status_counter = Counter()
    errors_5xx = []

    with path.open('r', errors='replace') as f:
        for line in f:
            m = NGINX_LOG_RE.match(line)
            if not m:
                continue
            if since:
                try:
                    entry_time = datetime.strptime(m.group('time'), LOG_TIME_FMT)
                    if entry_time < since:
                        continue
                except ValueError:
                    pass  # unparseable timestamp: include the line
            ip = m.group('ip')
            status = m.group('status')
            ip_counter[ip] += 1
            status_counter[status] += 1
            if status.startswith('5'):
                errors_5xx.append({
                    'ip': ip,
                    'time': m.group('time'),
                    'path': m.group('path'),
                    'status': status
                })

    return {
        'top_ips': ip_counter.most_common(10),
        'status_distribution': dict(status_counter),
        'server_errors': errors_5xx,
        'total_5xx': len(errors_5xx)
    }

Three things worth noting here. First, opening log files via Path.open() with errors='replace' prevents encoding errors from crashing your script when log entries contain malformed UTF-8 -- a real occurrence with web traffic. Second, using named groups in your regex ((?P<ip>...)) makes the code self-documenting and keeps the parsing logic readable when you come back to it six months later. Third, the optional since parameter demonstrates the advantage Python has over awk for time-based filtering: once the timestamp is parsed into a real datetime object, comparisons are exact and timezone-aware, with no string manipulation required.

Warning

For log files that are actively being written, Python's standard file iteration is not tail-aware -- it reads whatever is on disk at the time you open the file. If you need to follow a log in real time, look at the watchdog library or use subprocess to call journalctl -f for systemd journal logs specifically.

Remote Execution: Paramiko and Fabric

Everything covered so far runs locally. But the real leverage in Python-based automation comes when you apply the same patterns -- structured output, explicit error handling, idempotent operations -- across a fleet. One server with a disk monitoring script is useful. The same script, run from a single control point against fifty servers with results collected and compared, is how you catch the host that's quietly filling up while everything else looks fine.

Most sysadmin work does not happen on a single machine. As soon as you have more than a handful of servers, you need a way to run commands remotely and collect their output programmatically. Python has two primary tools for this: Paramiko and Fabric.

Paramiko (current version: 4.0.0) is, as its official documentation describes it, a pure-Python (3.6+) implementation of the SSHv2 protocol providing both client and server functionality. It is the underlying SSH engine. Paramiko's own documentation is explicit on this point: it recommends using Fabric for common client use-cases like running remote shell commands or transferring files, reserving direct Paramiko use for users who need advanced or low-level primitives.

Fabric (current version: 3.2.2) is the high-level layer. Fabric's documentation describes it as a library designed to execute shell commands remotely over SSH, yielding useful Python objects in return. It builds on top of Invoke (for local subprocess handling) and Paramiko, extending both to provide a clean API for fleet-level operations.

remote_ops.py
from fabric import Connection, SerialGroup

# Single host: check disk usage
def check_disk(host: str, user: str = 'ubuntu') -> str:
    with Connection(host, user=user) as c:
        result = c.run('df -h /', hide=True)
        return result.stdout.strip()

# Fleet operation: run a command on multiple hosts
def fleet_uptime(hosts: list[str]) -> dict:
    results = {}
    group = SerialGroup(*hosts)
    group_results = group.run('uptime', hide=True)
    for conn, result in group_results.items():
        results[conn.host] = result.stdout.strip()
    return results

# Upload a file and reload a service
def deploy_config(host: str, local_path: str, remote_path: str):
    with Connection(host) as c:
        c.put(local_path, remote=remote_path)
        c.sudo('systemctl reload nginx', hide=True)
        print(f"Config deployed and nginx reloaded on {host}")

Fabric's SerialGroup runs commands on hosts sequentially. If you need parallel execution, ThreadingGroup is the alternative -- it runs the same command on all hosts concurrently and returns when all have finished or failed. For large fleets, parallelism is essential, but it makes failure handling more complex, so start with serial and optimize only when the timing matters.

paramiko_direct.py
import paramiko

# Direct Paramiko for cases needing low-level control
def run_command_with_key(host: str, username: str, key_path: str, command: str) -> tuple[str, str, int]:
    client = paramiko.SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(paramiko.RejectPolicy())  # never AutoAdd in production

    try:
        client.connect(
            hostname=host,
            username=username,
            key_filename=key_path,
            timeout=10
        )
        stdin, stdout, stderr = client.exec_command(command)
        # recv_exit_status() blocks until the command finishes.
        # Without this, stdout.read() may return partial output.
        exit_code = stdout.channel.recv_exit_status()
        out = stdout.read().decode().strip()
        err = stderr.read().decode().strip()
        return out, err, exit_code
    finally:
        client.close()
Caution

Never use paramiko.AutoAddPolicy() in production scripts. It silently accepts any host key, which defeats SSH's host verification and opens you to man-in-the-middle attacks. Use RejectPolicy() and pre-populate your known_hosts file. If you're bootstrapping new hosts, verify the host key fingerprint through an out-of-band channel first.

One non-obvious Paramiko behavior worth knowing: exec_command() returns immediately -- it does not block until the remote command finishes. Calling stdout.read() before the command has written all its output produces partial results. The correct pattern is to call stdout.channel.recv_exit_status() first, which blocks until the channel closes and the remote process exits. Only then read stdout and stderr. The updated function signature above also returns the exit code, making success checking explicit rather than implicit.

Scheduling and Automation

Python scripts become genuinely powerful when you combine them with the system's scheduling infrastructure. There are two main approaches: using systemd timers to run scripts on a schedule (the modern approach, covered extensively in the systemd guide on this site), or using Python's own scheduling capabilities for in-process recurring tasks.

For scripts that need to run periodically and be managed like services, systemd timers are the right answer. Create a .service unit that runs your Python script and a .timer unit that triggers it. You get full integration with journald for logging, automatic restart on failure, and Persistent=true for catch-up runs after downtime.

/etc/systemd/system/disk-monitor.service
[Unit]
Description=Disk usage monitor script
After=network.target

[Service]
Type=oneshot
User=monitor
ExecStart=/usr/bin/python3 /opt/scripts/disk_monitor.py
StandardOutput=journal
StandardError=journal
SyslogIdentifier=disk-monitor
/etc/systemd/system/disk-monitor.timer
[Unit]
Description=Run disk monitor every 15 minutes

[Timer]
OnCalendar=*:0/15
Persistent=true

[Install]
WantedBy=timers.target

For in-process scheduling -- when you want a single long-running daemon to execute different tasks at different intervals -- the schedule library (pip install schedule) is a lightweight and readable option. But be aware that it runs on a single thread, so a slow or blocking task will delay everything else. For anything with real concurrency requirements, Python's asyncio with aiocron (pip install aiocron), or separate systemd timer units for each task, are more appropriate.

Structured Output and Reporting

Scripts that run silently and only speak up when something is wrong are far more maintainable than chatty scripts. Python's logging module gives you structured, leveled output that integrates cleanly with systemd journal -- no print statements scattered through your code.

script_template.py
import logging
import sys
import json
from datetime import datetime

# Configure logging to stdout (systemd captures this via journal)
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s %(name)s: %(message)s',
    datefmt='%Y-%m-%dT%H:%M:%S',
    stream=sys.stdout
)
logger = logging.getLogger('disk-monitor')

def main():
    import psutil
    threshold = 85.0
    results = []

    for part in psutil.disk_partitions(all=False):
        try:
            usage = psutil.disk_usage(part.mountpoint)
        except PermissionError:
            continue
        results.append({
            'mountpoint': part.mountpoint,
            'percent': usage.percent,
            'free_gb': round(usage.free / 1e9, 2)
        })
        if usage.percent >= threshold:
            logger.warning(
                "High disk usage on %s: %.1f%% used (%.2f GB free)",
                part.mountpoint, usage.percent, usage.free / 1e9
            )

    # Emit JSON summary for downstream consumption
    logger.info("disk_check_complete result=%s", json.dumps(results))

if __name__ == '__main__':
    main()

Using logger.warning(...) instead of print() means that when this script runs under systemd, the severity level is preserved in the journal. You can then query high-severity events from all your monitoring scripts with a single journalctl -p warning.

Security Considerations for Admin Scripts

Python scripts that run with elevated privileges are attack surface. A few principles that reduce risk significantly:

Avoid storing credentials in scripts. Use environment variables loaded at runtime, or Python's keyring library for secrets that need to persist. For service accounts, use SSH keys with appropriately scoped permissions rather than passwords. If you're passing credentials to external APIs, the python-dotenv library for loading from .env files is a reasonable pattern for non-production environments.

Validate and sanitize all inputs. Any value that comes from outside the script -- a command-line argument, an environment variable, data read from a file -- must be treated as untrusted. Use argparse with type validators for CLI arguments. Never interpolate external data directly into shell commands.

Run scripts with the minimum privilege required. If a monitoring script only needs to read /proc and call psutil, it should not run as root. Create dedicated service accounts with narrow permissions. The systemd User= directive in service units makes this easy to enforce consistently.

Use virtual environments for script dependencies. A system-wide pip install can break system tools that depend on specific library versions. For each script project that has third-party dependencies, create an isolated virtual environment:

$ python3 -m venv /opt/scripts/monitor-env && /opt/scripts/monitor-env/bin/pip install psutil paramiko fabric

Then reference the venv interpreter in your systemd ExecStart: /opt/scripts/monitor-env/bin/python3 /opt/scripts/monitor.py. This keeps your script's dependencies isolated from the system Python environment entirely.

CLI Design with argparse

The security section mentioned validating command-line arguments -- but the design of a script's CLI deserves its own attention. A script that takes no arguments and has its configuration baked in is fine for personal use. A script you're deploying across a team or scheduling via systemd needs a proper interface: named flags, sensible defaults, a help message that actually helps, and exit codes that automation can act on.

Python's argparse module is the standard library answer to this. It handles argument parsing, type coercion, required vs optional flags, and auto-generated help text with no external dependencies.

disk_report.py
import argparse
import json
import sys
import psutil

def parse_args():
    parser = argparse.ArgumentParser(
        description='Report disk usage and alert on high utilization.',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    parser.add_argument(
        '--threshold',
        type=float,
        default=85.0,
        help='Alert threshold as a percentage (0-100)'
    )
    parser.add_argument(
        '--mountpoints',
        nargs='+',
        default=None,
        metavar='PATH',
        help='Mountpoints to check. Defaults to all physical partitions.'
    )
    parser.add_argument(
        '--json',
        action='store_true',
        help='Output results as JSON'
    )
    return parser.parse_args()

def main():
    args = parse_args()

    if not (0 <= args.threshold <= 100):
        print("error: --threshold must be between 0 and 100", file=sys.stderr)
        sys.exit(2)

    partitions = psutil.disk_partitions(all=False)
    if args.mountpoints:
        partitions = [p for p in partitions if p.mountpoint in args.mountpoints]

    results = []
    alert = False
    for part in partitions:
        try:
            usage = psutil.disk_usage(part.mountpoint)
        except (PermissionError, psutil.AccessDenied):
            continue
        entry = {'mountpoint': part.mountpoint, 'percent': usage.percent}
        results.append(entry)
        if usage.percent >= args.threshold:
            alert = True

    if args.json:
        print(json.dumps(results, indent=2))
    else:
        for r in results:
            flag = " <-- ALERT" if r['percent'] >= args.threshold else ""
            print(f"{r['mountpoint']:30s} {r['percent']:5.1f}%{flag}")

    # Exit 1 if any partition exceeded the threshold
    sys.exit(1 if alert else 0)

if __name__ == '__main__':
    main()

A few things this example demonstrates that are worth making habits. formatter_class=argparse.ArgumentDefaultsHelpFormatter automatically adds the default value to each argument's help text -- which saves the person reading the help message from having to dig through the code. The explicit sys.exit() calls with specific exit codes (0 for clean, 1 for threshold exceeded, 2 for invalid input) mean that the script participates correctly in shell conditionals and systemd's SuccessExitStatus= directive. And separating the --json output path from the human-readable path means the same script can serve both a human running it interactively and a monitoring pipeline consuming its output.

Pro Tip

Resist the urge to use sys.exit() mid-function when an error occurs. Centralize your exit points in main() and let functions raise exceptions that bubble up. This makes the script far easier to test -- you can call the functions directly without triggering a process exit.

It's worth pausing to name what three standard library modules -- argparse, logging, and sys -- accomplish together. argparse gives your script a self-describing interface with defaults and type validation. logging routes diagnostic output through severity levels that systemd journal preserves and you can query later. sys.exit() with meaningful codes makes the script a proper participant in shell pipelines and service managers. Together, they constitute a professional script contract: the script accepts structured input, produces structured output, communicates errors through the right channels, and exits in a way that automation can act on. None of these individually is impressive. Combined, they're the difference between a script that someone else can use and one that only works when its author is watching.

Exception Handling Strategy

Error handling scattered through code examples is easy to miss as a topic in its own right. It deserves direct attention, because how you handle exceptions in sysadmin scripts has real consequences -- not just for correctness, but for diagnosability when something goes wrong at midnight.

The governing principle is: catch specifically, fail loudly, log with context. A bare except: clause that swallows any error and continues is almost always wrong. It hides bugs, masks partial failures, and produces scripts where a silent wrong answer is harder to diagnose than an explicit crash.

exception_patterns.py
import subprocess
import logging

logger = logging.getLogger(__name__)

# Bad: swallows all exceptions, hides the real failure
def restart_service_bad(name: str):
    try:
        subprocess.run(['systemctl', 'restart', name], check=True)
    except:
        pass  # silently continues even if systemctl fails

# Good: catches what you expect, re-raises or logs with context
def restart_service(name: str) -> bool:
    try:
        subprocess.run(
            ['systemctl', 'restart', name],
            check=True,
            capture_output=True,
            text=True,
            timeout=30
        )
        logger.info("Restarted service: %s", name)
        return True
    except subprocess.CalledProcessError as e:
        logger.error("systemctl restart %s failed (exit %d): %s", name, e.returncode, e.stderr.strip())
        return False
    except subprocess.TimeoutExpired:
        logger.error("systemctl restart %s timed out", name)
        return False
    except FileNotFoundError:
        logger.error("systemctl not found -- is this a systemd host?")
        raise  # environment problem: re-raise, don't hide it

A few patterns worth making habits. First, include the stderr output in your error logs when a subprocess fails -- that's where the actual failure message lives, and without it you're logging that a thing failed without logging why. Second, distinguish between expected failures (a service that isn't installed, a file that doesn't exist yet) and environmental failures (systemctl not found, filesystem not mounted). Expected failures are worth handling gracefully; environmental failures are usually worth re-raising or exiting hard, because continuing is likely to make things worse.

Third, be careful with partial failures in loops. When iterating over a list of hosts or partitions and one fails, the right behavior depends on context. In a monitoring script, you usually want to log the failure, continue the loop, and exit with a non-zero code at the end. In a provisioning script, a failure on one host may mean you should stop immediately rather than continue onto the next. Neither is universally correct -- the point is to make the choice explicitly rather than having it fall out of which exception got caught.

Warning

When using logger.error() to record an exception, prefer logger.exception() if you want the full traceback captured in the log. logger.exception() is equivalent to logger.error() but automatically appends the current exception's traceback -- which is what you want when diagnosing failures from journalctl output after the fact.

Writing Idempotent Scripts

Here is a situation that happens more than it should: a deployment script runs, hits a timeout partway through, and leaves a host in an unknown intermediate state. The on-call engineer needs to know whether it's safe to run the script again. Nobody knows for certain. Running it risks making the state worse; not running it leaves the host broken. The correct answer should always be "yes, run it again" -- and that requires writing for idempotency from the start, not retrofitting it after an incident.

A script is idempotent if running it multiple times produces the same result as running it once. This sounds like a minor quality-of-life concern until you're recovering from an interrupted deployment at 2am and you need to know whether it's safe to re-run the provisioning script on a host that was halfway through.

The good news is that most of the standard library already pushes you in this direction. mkdir(exist_ok=True) instead of checking if a directory exists first. shutil.copy2() which overwrites the destination if it already exists. The user_exists() check before calling useradd shown in the user management section. These are idempotency patterns, even when they're not labeled as such.

The more interesting challenge is handling state that isn't just file presence. Here is a pattern for managing a line in a configuration file -- adding it if missing, leaving it alone if already present, never duplicating it:

idempotent_config.py
from pathlib import Path
import shutil
import logging
from datetime import datetime

logger = logging.getLogger(__name__)

def ensure_line_in_file(filepath: str, line: str, comment: str = None) -> bool:
    """
    Ensure a line is present in a file. Returns True if the file was modified.
    Creates the file if it doesn't exist.
    """
    path = Path(filepath)
    target = line.strip()

    # Read existing content (or start empty)
    existing = path.read_text() if path.exists() else ""

    # Idempotency check: already present, nothing to do
    if target in (l.strip() for l in existing.splitlines()):
        return False

    # Back up before modifying
    if path.exists():
        ts = datetime.now().strftime("%Y%m%d_%H%M%S")
        shutil.copy2(path, f"{filepath}.bak.{ts}")

    # Append the line (with optional comment header)
    with path.open('a') as f:
        if not existing.endswith('\n') and existing:
            f.write('\n')  # ensure we start on a new line
        if comment:
            f.write(f"# {comment}\n")
        f.write(f"{target}\n")

    return True

# Usage: safe to run on a host that already has this entry
modified = ensure_line_in_file(
    "/etc/security/limits.conf",
    "appuser soft nofile 65536",
    comment="Set by provisioning script"
)
logger.info("limits.conf %s", "updated" if modified else "already up to date")

The function backs up the file before touching it, which matters when you're modifying system configuration files. The backup filename includes a timestamp, so repeated runs don't overwrite each other's backups. And the return value gives the caller a clean signal about whether anything actually changed -- useful for deciding whether downstream steps (like reloading a service) are necessary.

Note

The same principle applies to service management. Before calling systemctl enable or systemctl start, check the current state with subprocess.run(['systemctl', 'is-enabled', service], capture_output=True) and act only if the state is not already what you want. Scripts that call systemctl unconditionally produce noisy output and can mask real errors in your logs.

Testing Admin Scripts

The article introduction promised production-quality patterns. Testing is the part that makes that claim defensible. A Python script that runs as root and modifies system configuration needs to be verified -- but obviously you cannot run it against a live system during development and call that testing.

The standard approach is to write functions that do one thing and accept their dependencies as parameters rather than reaching for the filesystem or subprocess directly. Then test those functions with mocked dependencies using Python's unittest.mock module, which is part of the standard library.

test_users.py
import unittest
from unittest.mock import patch, MagicMock
from users import user_exists, create_service_account, group_members

class TestUserExists(unittest.TestCase):

    @patch('users.pwd.getpwnam')
    def test_returns_true_when_user_found(self, mock_getpwnam):
        mock_getpwnam.return_value = MagicMock()
        self.assertTrue(user_exists('appuser'))
        mock_getpwnam.assert_called_once_with('appuser')

    @patch('users.pwd.getpwnam', side_effect=KeyError)
    def test_returns_false_when_user_missing(self, mock_getpwnam):
        self.assertFalse(user_exists('nonexistent'))

class TestGroupMembers(unittest.TestCase):

    @patch('users.grp.getgrnam')
    def test_returns_member_list(self, mock_getgrnam):
        mock_getgrnam.return_value = MagicMock(gr_mem=['alice', 'bob'])
        self.assertEqual(group_members('sudo'), ['alice', 'bob'])

    @patch('users.grp.getgrnam', side_effect=KeyError)
    def test_raises_on_missing_group(self, mock_getgrnam):
        with self.assertRaises(ValueError):
            group_members('nonexistent')

class TestCreateServiceAccount(unittest.TestCase):

    @patch('users.user_exists', return_value=True)
    @patch('users.subprocess.run')
    def test_skips_useradd_if_user_exists(self, mock_run, mock_exists):
        create_service_account('appuser')
        mock_run.assert_not_called()  # idempotency: no subprocess call

    @patch('users.user_exists', return_value=False)
    @patch('users.subprocess.run')
    def test_calls_useradd_with_correct_args(self, mock_run, mock_exists):
        create_service_account('appuser')
        cmd = mock_run.call_args.args[0]  # Python 3.8+ call_args.args
        self.assertIn('useradd', cmd)
        self.assertIn('--system', cmd)
        self.assertIn('appuser', cmd)

if __name__ == '__main__':
    unittest.main()

The @patch decorator intercepts the named attribute at test time and replaces it with a MagicMock. The key insight is the patching target: you patch 'users.pwd.getpwnam', not 'pwd.getpwnam', because you want to intercept it in the module where it is used, not in the module where it is defined. Getting this wrong is the most common reason patch appears to have no effect. Also note the modern call_args.args[0] syntax for inspecting what was passed to a mock -- the older call_args[0][0] form still works but the named-attribute form (available since Python 3.8) is considerably clearer to read.

For filesystem tests, use tmp_path if you adopt pytest (which is worth doing -- pytest's test discovery, fixtures, and assertion introspection are substantially better than the standard unittest runner). tmp_path gives each test a fresh, isolated temporary directory that is cleaned up automatically after the test finishes. You can run real file operations against it without touching the actual filesystem.

test_idempotent_config.py
from idempotent_config import ensure_line_in_file

def test_adds_missing_line(tmp_path):
    cfg = tmp_path / "limits.conf"
    cfg.write_text("# existing content\n")

    modified = ensure_line_in_file(str(cfg), "appuser soft nofile 65536")

    assert modified is True
    assert "appuser soft nofile 65536" in cfg.read_text()

def test_does_not_duplicate_existing_line(tmp_path):
    cfg = tmp_path / "limits.conf"
    cfg.write_text("appuser soft nofile 65536\n")

    modified = ensure_line_in_file(str(cfg), "appuser soft nofile 65536")
    content = cfg.read_text()

    assert modified is False
    assert content.count("appuser soft nofile 65536") == 1

def test_creates_file_if_missing(tmp_path):
    cfg = tmp_path / "newfile.conf"

    modified = ensure_line_in_file(str(cfg), "some_setting = yes")

    assert modified is True
    assert cfg.exists()

def test_creates_backup_before_modifying(tmp_path):
    cfg = tmp_path / "limits.conf"
    cfg.write_text("# original\n")

    ensure_line_in_file(str(cfg), "new_setting = 1")

    backups = list(tmp_path.glob("limits.conf.bak.*"))
    assert len(backups) == 1
    assert backups[0].read_text() == "# original\n"

def test_writes_comment_header(tmp_path):
    cfg = tmp_path / "limits.conf"
    cfg.write_text("")

    ensure_line_in_file(str(cfg), "fs.file-max = 100000", comment="tuning")
    content = cfg.read_text()

    assert "# tuning" in content
    assert content.index("# tuning") < content.index("fs.file-max")

The five tests above cover the core contract of ensure_line_in_file: adds a missing entry, refuses to duplicate an existing one, creates the file if it doesn't exist, creates a timestamped backup before any modification, and writes the optional comment header in the correct position. Together they document the function's behavior precisely enough that anyone reading the test file knows exactly what the function is supposed to do -- without running it. That is what good tests provide beyond catching bugs: they are executable specifications.

There is a design principle hiding inside the testing requirement, and it's worth naming explicitly: if a function is hard to test without root or without a live system, the problem is not the testing -- it's the function's design. Functions that mix logic with privilege are hard to test and hard to reason about. The solution is to extract the logic that can be tested cleanly and leave the privileged operations as thin wrappers around it. A function that checks whether a configuration entry already exists has no reason to also write to the filesystem. Separate those concerns, and you can test both independently. The testable part gets a proper test. The privileged wrapper gets a targeted integration test or a manual check. This pattern -- logic separate from side effects -- is as applicable to sysadmin scripts as it is to any other software.

Pro Tip

Add a Makefile or a justfile at the root of your scripts repository with a test target: python -m pytest tests/ -v. Running tests becomes a single command, which means it actually gets run. Pair it with a pre-commit hook and you catch regressions before they reach production.

Putting It Together

Python earns its place in Linux system administration not by replacing shell tools but by giving you the ability to write automation you can trust. Bash handles the quick, composable tasks. Python handles the logic-heavy, data-rich automation where you need real error handling, testable code, and structured output -- and where the cost of a script behaving unexpectedly at 2am is not just inconvenient but potentially damaging.

The library stack to internalize is straightforward: pathlib and os for filesystem operations, subprocess with argument lists for calling system commands, shutil for higher-level file management, psutil for system monitoring, pwd and grp for user database access, and Fabric with Paramiko for remote execution. These cover the large majority of what sysadmins actually need to automate.

But the libraries are the smaller part of what makes a script production-ready. The larger part is the discipline: validate your inputs, handle your exceptions explicitly, log at appropriate severity levels, run with minimum required privilege, isolate dependencies. Design CLIs with argparse so your scripts have proper interfaces. Write for idempotency so re-runs are always safe. Separate logic from privilege so your code can be tested without a live system. These practices do not require more time -- they require different habits. Python makes them easier to develop than shell scripting does, which is ultimately why it belongs in every sysadmin's toolkit.

Sources and Further Reading