Ansible roles are the fundamental building blocks of maintainable automation. A well-structured role is portable across projects, testable in isolation, and transparent enough that a new team member can understand it without reading a novel of comments. A poorly structured one becomes the thing nobody wants to touch -- the role that only works on that one server, with that one inventory file, run by that one person who left the company six months ago.
This article covers the patterns that separate roles built for a single playbook from roles built for an organization. We'll walk through Molecule testing from scratch, dissect Ansible's notoriously complex variable precedence system, and examine handler strategies that prevent the subtle bugs that surface at 2 AM during a production deployment.
Anatomy of a Scalable Role
Every Ansible role follows a standard directory structure historically generated by ansible-galaxy init (or the newer ansible-creator tool, which is now the recommended scaffolding method). But understanding which directories actually matter -- and how to use them correctly -- is what separates a role that works from one that scales. According to the official Ansible documentation, roles automatically load related vars, files, tasks, handlers, and other artifacts based on a known file structure.
# Standard Ansible role layout roles/nginx/ ├── defaults/ │ └── main.yml # Variables users SHOULD override ├── vars/ │ └── main.yml # Variables users should NOT override ├── tasks/ │ ├── main.yml # Entry point -- delegates to subtasks │ ├── install.yml │ ├── configure.yml │ └── service.yml ├── handlers/ │ └── main.yml # Event-driven tasks (restarts, reloads) ├── templates/ │ └── nginx.conf.j2 # Jinja2 templates ├── files/ # Static files to copy ├── meta/ │ ├── main.yml # Role metadata + dependencies │ └── argument_specs.yml # Input validation (Ansible 2.11+) ├── molecule/ # Test scenarios │ └── default/ └── README.md
The critical distinction here is between defaults/ and vars/. Variables in defaults/main.yml sit at the absolute bottom of Ansible's precedence hierarchy -- they exist specifically to be overridden by inventory variables, group_vars, host_vars, or playbook-level declarations. Variables in vars/main.yml sit much higher in the precedence stack and are meant for internal role constants that consumers should not change. Getting this wrong is the single most common source of "why isn't my variable taking effect?" bugs in Ansible.
Prefix every variable in your role with the role name. A role called nginx should define nginx_worker_processes, not worker_processes. This prevents collisions when multiple roles are composed in a single playbook -- a problem the Red Hat Communities of Practice documentation explicitly warns about.
A good defaults/main.yml reads like an API contract. It documents every knob the consumer can turn, with sensible defaults that work out of the box.
# roles/nginx/defaults/main.yml # All variables are prefixed with the role name to prevent # collisions in multi-role playbooks. # Package management nginx_package_name: "nginx" nginx_package_state: "present" # Service configuration nginx_service_name: "nginx" nginx_service_enabled: true nginx_service_state: "started" # Core nginx.conf tunables nginx_worker_processes: "auto" nginx_worker_connections: 1024 nginx_keepalive_timeout: 65 nginx_multi_accept: true # Log paths nginx_access_log: "/var/log/nginx/access.log" nginx_error_log: "/var/log/nginx/error.log" # Virtual hosts nginx_vhosts: [] # SSL defaults nginx_ssl_protocols: "TLSv1.2 TLSv1.3" nginx_ssl_ciphers: "HIGH:!aNULL:!MD5"
Task Decomposition
The tasks/main.yml file should be a dispatcher, not a monolith. Split tasks into logical subtask files and include them in order. This improves readability, makes it easier to conditionally skip entire phases, and simplifies debugging when things go wrong.
# roles/nginx/tasks/main.yml # Entry point -- orchestrates subtask includes --- - name: Include OS-specific variables ansible.builtin.include_vars: "{{ ansible_os_family | lower }}.yml" - name: Install nginx packages ansible.builtin.include_tasks: install.yml - name: Configure nginx ansible.builtin.include_tasks: configure.yml - name: Configure virtual hosts ansible.builtin.include_tasks: vhosts.yml when: nginx_vhosts | length > 0 - name: Ensure nginx service state ansible.builtin.include_tasks: service.yml
Notice the include_vars call at the top. This pattern loads OS-family-specific variables (package names, config paths, service names) from files like vars/debian.yml and vars/redhat.yml, making the role portable across distributions without cluttering every task with when: ansible_os_family == 'Debian' conditionals.
Validate Inputs with argument_specs
Starting with Ansible 2.11, roles can define an argument_specs section in meta/main.yml (or a standalone meta/argument_specs.yml file) that formally documents and validates every variable the role accepts. This gives your role a machine-readable API contract: Ansible will reject playbook runs that pass invalid types, miss required variables, or supply values outside the allowed choices -- before a single task executes.
# meta/argument_specs.yml # Validates role inputs before tasks execute --- argument_specs: main: short_description: Install and configure nginx description: - Installs nginx, manages configuration, and configures virtual hosts. options: nginx_worker_processes: type: str default: "auto" description: Number of worker processes or "auto" nginx_worker_connections: type: int default: 1024 description: Max simultaneous connections per worker nginx_ssl_protocols: type: str default: "TLSv1.2 TLSv1.3" description: Allowed TLS protocol versions nginx_vhosts: type: list elements: dict default: [] description: List of virtual host configurations nginx_service_enabled: type: bool default: true description: Whether nginx starts on boot
When teams adopt argument_specs, they gain two things at once: automatic input validation at runtime and auto-generated documentation via ansible-doc -t role nginx (the role must be on your configured role path or installed within a collection for ansible-doc to discover it). For roles shared across an organization, this eliminates the "read the README to figure out what variables exist" problem and catches misconfiguration before it reaches a target host.
Variable Precedence: The 22 Layers of Pain
Ansible's variable precedence system is one of its most confusing aspects, and misunderstanding it accounts for a disproportionate share of debugging time. The official documentation lists the full precedence order from least to greatest priority. Understanding the critical layers -- and where role variables fall -- is essential for writing roles that behave predictably when composed.
Here is the precedence order from lowest (most easily overridden) to highest (overrides all others):
Command line values(e.g.,-u my_user-- these are not variables, but they do override defaults)Role defaults(defaults/main.yml) -- lowest-priority variablesInventory file or script group varsInventory group_vars/allPlaybook group_vars/allInventory group_vars/*Playbook group_vars/*Inventory file or script host varsInventory host_vars/*Playbook host_vars/*Host facts / cached set_factsPlay varsPlay vars_promptPlay vars_filesRole vars(vars/main.yml)Block vars(only for tasks within the block)Task vars(only for the specific task)include_varsset_facts / registered varsRole params(when passed viaroles:keyword)include paramsExtra vars(-eon the command line) -- highest priority
The gap between defaults/main.yml (level 2) and vars/main.yml (level 15) is enormous. Variables in vars/ override inventory variables, group_vars, host_vars, and play vars. If you put user-configurable values in vars/main.yml, consumers cannot override them through normal inventory mechanisms -- they'd need include_vars, set_fact, or --extra-vars to win.
The practical rule is straightforward: put everything the user should be able to change in defaults/main.yml. Put platform constants and internal implementation details in vars/main.yml. The Ansible documentation itself advises this approach, noting that variables in the defaults folder are designed to be easy to override while those in the vars directory are meant for values that should remain consistent. One additional nuance: variables created with set_fact using the cacheable: true option have high precedence during the current play, but when loaded from the fact cache in a subsequent play, they revert to the same precedence level as host facts -- a subtle distinction that can cause confusion in multi-play playbooks.
The include_vars Trap
One of the subtlest precedence issues involves include_vars. When you use include_vars inside a role (for example, to load OS-specific variables), the loaded variables take precedence level 18 -- above role vars, block vars, and task vars. However, set_fact and registered variables sit one level higher at 19, so they can override include_vars values. The real danger is that include_vars silently overrides inventory variables, group_vars, host_vars, and play vars -- meaning a user who set a value in their inventory may find it unexpectedly ignored.
# vars/debian.yml -- loaded via include_vars in tasks/main.yml # WARNING: these will override play vars and role vars! nginx_package_name: "nginx-full" # overrides any play-level definition nginx_conf_dir: "/etc/nginx" # SAFE: only define platform-specific implementation details here, # never user-facing defaults. Use unique variable names that # don't collide with defaults/main.yml. __nginx_os_package: "nginx-full" # double-underscore = internal __nginx_conf_path: "/etc/nginx"
The convention of double-underscore prefixing internal variables (like __nginx_os_package) was popularized by Jeff Geerling's roles on Ansible Galaxy and is now recognized as a community best practice -- the Red Hat Automation Good Practices guide explicitly cites his roles as prior art for this convention. This naming pattern signals that a variable is an internal implementation detail, not part of the role's public API, and prevents accidental collisions with user-defined variables. Note that Ansible itself treats underscore-prefixed variables identically to any other variable -- this is purely a community convention, not a language feature.
Debugging Precedence Issues
When a variable isn't behaving as expected, Ansible's verbose mode is your first tool. Running your playbook with increasing levels of -v flags reveals where each variable is coming from.
# Level 1: task-level output $ ansible-playbook site.yml -v # Level 3: shows variable origins and connection details $ ansible-playbook site.yml -vvv # Nuclear option: print a variable and its type mid-play $ ansible -m debug -a "var=hostvars[inventory_hostname]['nginx_worker_processes']" webservers # Force a specific value to override everything $ ansible-playbook site.yml --extra-vars "nginx_worker_processes=4"
Instead of worrying about variable precedence, we encourage you to think about how easily or how often you want to override a variable when deciding where to set it. -- Ansible Official Documentation
Testing Roles with Molecule
Molecule is the official testing framework for Ansible content, maintained as a Red Hat-backed project. It provisions ephemeral infrastructure (typically Docker containers), applies your role, verifies the result, checks for idempotency, and tears everything down. According to the Molecule project documentation, it is designed to support testing with multiple instances, operating systems, distributions, virtualization providers, and test scenarios.
The framework runs a well-defined test sequence. When you execute molecule test, the default sequence is:
- dependency -- install Galaxy requirements
- cleanup -- run cleanup playbook (if defined)
- destroy -- ensure no leftover infrastructure from previous runs
- syntax -- run
ansible-playbook --syntax-check - create -- provision test instances via the configured driver
- prepare -- run a preparatory playbook on fresh instances
- converge -- run the role against the instances
- idempotence -- run the role again, fail if anything reports
changed - side_effect -- optional playbook for external interactions
- verify -- run state-checking tests
- cleanup -- final cleanup
- destroy -- tear down all infrastructure
This sequence is fully customizable via the test_sequence key in molecule.yml. Teams often define shorter sequences for development (skipping idempotence and side_effect) and reserve the full sequence for CI pipelines.
Setting Up Molecule
Install Molecule alongside the Docker driver plugin. Podman is also supported if you're working in environments where Docker's daemon model isn't acceptable.
# Install Molecule with Docker support $ pip install molecule molecule-plugins[docker] # Verify installation $ molecule --version # Install linting tools $ pip install ansible-lint yamllint # Initialize Molecule in an existing role $ cd roles/nginx $ molecule init scenario -d docker
This generates the molecule/default/ directory containing three critical files: molecule.yml (the configuration), converge.yml (the playbook that applies your role), and verify.yml (the test assertions).
Multi-Platform molecule.yml
Testing on a single OS is a recipe for surprises. A properly configured molecule.yml tests across the distributions your role claims to support. Jeff Geerling maintains a set of Docker images with Ansible pre-installed and systemd enabled (the geerlingguy/docker-*-ansible images), which are the de facto standard for Molecule testing.
# molecule/default/molecule.yml --- dependency: name: galaxy options: requirements-file: requirements.yml driver: name: docker platforms: - name: nginx-ubuntu2204 image: geerlingguy/docker-ubuntu2204-ansible:latest pre_build_image: true privileged: true volumes: - /sys/fs/cgroup:/sys/fs/cgroup:rw cgroupns_mode: host command: /usr/sbin/init - name: nginx-debian12 image: geerlingguy/docker-debian12-ansible:latest pre_build_image: true privileged: true volumes: - /sys/fs/cgroup:/sys/fs/cgroup:rw cgroupns_mode: host command: /usr/sbin/init - name: nginx-rocky9 image: geerlingguy/docker-rockylinux9-ansible:latest pre_build_image: true privileged: true volumes: - /sys/fs/cgroup:/sys/fs/cgroup:rw cgroupns_mode: host command: /usr/sbin/init provisioner: name: ansible playbooks: converge: converge.yml verify: verify.yml verifier: name: ansible
The privileged: true and /sys/fs/cgroup volume mount are required for systemd to function inside Docker containers. Without them, any task that manages services via systemctl will fail. If you're testing roles that don't interact with systemd, you can omit these and use lighter-weight images.
Writing the Converge Playbook
The converge playbook is what Molecule runs against your test instances. Keep it minimal -- it should invoke the role and nothing else, unless you need to simulate a realistic integration context.
# molecule/default/converge.yml --- - name: Converge hosts: all become: true vars: nginx_vhosts: - server_name: "test.local" root: "/var/www/test" listen: "80" roles: - role: "{{ lookup('env', 'MOLECULE_PROJECT_DIRECTORY') | basename }}"
Writing Verify Tests
The verify playbook runs after converge and asserts that the system is in the desired state. Molecule uses Ansible itself as the default verifier since version 3, replacing the earlier TestInfra default. The ansible.builtin.assert module is your primary tool here.
# molecule/default/verify.yml --- - name: Verify nginx role hosts: all become: true gather_facts: true tasks: - name: Gather package facts ansible.builtin.package_facts: manager: auto - name: Assert nginx is installed ansible.builtin.assert: that: - "'nginx' in ansible_facts.packages or 'nginx-full' in ansible_facts.packages" fail_msg: "nginx package is not installed" - name: Assert nginx service is running ansible.builtin.service_facts: - name: Verify nginx service state ansible.builtin.assert: that: - "ansible_facts.services['nginx.service'].state == 'running'" - "ansible_facts.services['nginx.service'].status == 'enabled'" fail_msg: "nginx service is not running or not enabled" - name: Verify nginx configuration syntax ansible.builtin.command: nginx -t changed_when: false - name: Check nginx is listening on port 80 ansible.builtin.wait_for: port: 80 timeout: 10 - name: Verify virtual host config exists ansible.builtin.stat: path: "/etc/nginx/sites-enabled/test.local.conf" register: vhost_config when: ansible_os_family == 'Debian' - name: Assert vhost config is present ansible.builtin.assert: that: vhost_config.stat.exists when: ansible_os_family == 'Debian'
The Idempotence Check
The idempotence phase is where Molecule runs your converge playbook a second time and fails if any task reports changed. This is arguably the single most valuable test Molecule performs. An idempotent role guarantees that running it twice produces the same state -- meaning your role won't introduce drift or trigger unnecessary service restarts on every execution.
Tasks using ansible.builtin.command or ansible.builtin.shell always report changed by default. You must add changed_when: false (for read-only commands) or a meaningful changed_when expression to prevent false positives in the idempotence check. For tasks that are inherently non-idempotent by nature (such as one-time provisioning steps), you can skip them during Molecule runs by adding tags: ["molecule-notest"] to the task -- Molecule automatically skips tasks with this tag during the idempotence phase.
Multiple Test Scenarios
A single "default" scenario is often insufficient. Molecule supports multiple named scenarios to test different configurations, edge cases, or integration contexts.
# Create an additional scenario for SSL testing $ molecule init scenario --scenario-name with_ssl -d docker # Run a specific scenario $ molecule test -s with_ssl # Run all scenarios $ molecule test --all # Development workflow: converge without full teardown $ molecule converge # apply the role $ molecule verify # run tests only $ molecule login -h nginx-ubuntu2204 # SSH into instance $ molecule idempotence # check idempotence only $ molecule destroy # clean up when done
The development workflow of converge, verify, login, iterate is far faster than running molecule test each time, which destroys and recreates everything. Use molecule test for CI pipelines and final validation, and the individual subcommands for rapid development.
Handler Strategies That Won't Bite You
Handlers are one of Ansible's more elegant concepts: tasks that only run when notified, and that only run once regardless of how many tasks notify them. According to the official Ansible documentation, handlers are efficient because a handler only executes once even if multiple tasks trigger it -- preventing, for example, Apache from being bounced multiple times during a single playbook run.
But handlers have several behaviors that catch people off guard in production.
Handlers Run at the End of the Play
By default, all notified handlers run after the last task in the play completes. This means if task 3 notifies a handler to restart nginx, and task 15 tries to hit the nginx endpoint, the restart hasn't happened yet. The service is still running the old configuration.
The fix is meta: flush_handlers, which forces all pending handlers to execute immediately at that point in the play.
# roles/nginx/tasks/configure.yml --- - name: Template nginx.conf ansible.builtin.template: src: nginx.conf.j2 dest: "{{ nginx_conf_dir }}/nginx.conf" owner: root group: root mode: '0644' validate: "nginx -t -c %s" notify: reload nginx - name: Flush handlers to apply config now ansible.builtin.meta: flush_handlers - name: Verify nginx is responding ansible.builtin.uri: url: http://localhost status_code: 200 retries: 3 delay: 2
Add a meta: flush_handlers at the end of every role's tasks/main.yml. This ensures that handlers triggered by your role execute before the next role in the play begins, preventing cross-role ordering issues where role B depends on a service restart that role A notified but hasn't flushed yet.
The listen Keyword: Decoupled Handler Groups
Ansible 2.2 introduced the listen keyword, which allows multiple handlers to subscribe to a single topic. This is significantly more flexible than notifying handlers by name, and it decouples tasks from specific handler implementations. The official documentation notes that this feature is particularly useful when sharing handlers among playbooks and roles.
# roles/nginx/handlers/main.yml --- - name: validate nginx config ansible.builtin.command: nginx -t changed_when: false listen: "reload nginx" - name: reload nginx service ansible.builtin.systemd: name: "{{ nginx_service_name }}" state: reloaded listen: "reload nginx" - name: restart nginx service ansible.builtin.systemd: name: "{{ nginx_service_name }}" state: restarted listen: "restart nginx"
When a task notifies reload nginx, both the validation handler and the reload handler execute -- in the order they're defined, not the order they were notified. The validation runs first, and if it fails, the reload never fires. This is a critical safety pattern: always validate configuration before restarting or reloading a service.
Handler Name Collisions Across Roles
Here's a subtle trap: handlers from roles are inserted into the global scope of the play, not scoped to their role. This means if you have two roles -- say nginx and apache -- and both define a handler named restart webserver, only the last one loaded will actually run. The Ansible documentation explicitly warns about this, recommending that you use the form role_name : handler_name when notifying handlers to ensure you trigger the correct one.
# BAD: generic handler names will collide across roles notify: restart webserver # GOOD: role-prefixed handler names notify: restart nginx # ALSO GOOD: fully qualified role:handler syntax notify: nginx : restart nginx service
Prefer reload Over restart
When a service supports graceful reload (SIGHUP or an equivalent mechanism), always prefer state: reloaded over state: restarted. A reload picks up configuration changes without dropping active connections. A restart terminates all connections and starts the process fresh. For web servers, database proxies, and load balancers in production, the difference between a reload and a restart is the difference between zero-downtime deployment and a page of alerts.
CI/CD Integration
Molecule tests are only useful if they run automatically. Integrating Molecule into your CI pipeline ensures that every pull request against a role is validated before merge. GitHub Actions is the simplest path for open-source roles. For teams using Ansible Automation Platform, ansible-navigator provides an alternative execution model that wraps playbook runs in execution environments (container images with bundled collections and dependencies), which can also be integrated into CI pipelines for more production-representative testing.
# .github/workflows/molecule.yml --- name: Molecule Test on: push: branches: [main] pull_request: branches: [main] jobs: molecule: runs-on: ubuntu-latest strategy: matrix: scenario: - default - with_ssl fail-fast: false steps: - name: Checkout uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: | pip install molecule molecule-plugins[docker] ansible-lint yamllint - name: Run Molecule run: molecule test -s ${{ matrix.scenario }} env: MOLECULE_DISTRO: ubuntu2204
For GitLab CI, the approach is similar. The Sysbee engineering team documented their approach using service containers to run Molecule within GitLab CI, noting that systemd integration within Docker containers was the primary challenge they had to solve for reliable testing.
# .gitlab-ci.yml --- stages: - test molecule_test: stage: test image: docker:24-dind services: - docker:24-dind variables: DOCKER_HOST: tcp://docker:2376 DOCKER_TLS_CERTDIR: "/certs" DOCKER_CERT_PATH: "/certs/client" DOCKER_TLS_VERIFY: "1" before_script: - apk add --no-cache python3 py3-pip gcc musl-dev python3-dev libffi-dev - pip install molecule[docker] ansible-lint --break-system-packages script: - molecule test rules: - changes: - roles/nginx/**
Patterns That Work at Scale
When you're managing a handful of roles, you can get away with ad-hoc conventions. When you're managing dozens -- across multiple teams, products, and environments -- you need enforced patterns.
Pin Everything
Use a requirements.yml file with explicit version pins for all role dependencies. Unpinned dependencies are the Ansible equivalent of npm install with no lockfile -- your playbook works today and breaks tomorrow because an upstream author pushed a breaking change.
# requirements.yml -- pin all the things # Check Ansible Galaxy for current versions before using --- roles: - name: geerlingguy.docker version: "7.4.1" - name: geerlingguy.certbot version: "5.2.0" collections: - name: ansible.posix version: "1.6.0" - name: community.general version: "9.0.0"
Use ansible-lint in CI
The ansible-lint tool catches deprecated module usage, style violations, and common mistakes. Run it alongside Molecule in your CI pipeline. It is an official Ansible-maintained project and has become a standard part of the Ansible development toolchain.
Use FQCNs Everywhere
Always use Fully Qualified Collection Names for modules: ansible.builtin.template instead of template, ansible.builtin.service instead of service. This eliminates ambiguity when multiple collections provide modules with the same short name, makes your roles forward-compatible with future Ansible versions, and is enforced by ansible-lint by default.
Loose Coupling Through Role Dependencies
You can declare role dependencies in meta/main.yml, but use this sparingly. Hard dependencies on external roles reduce flexibility and create version management headaches. The Red Hat Automation Good Practices guide warns that roles with hard dependencies on external roles have limited flexibility and increased risk that changes to the dependency will result in unexpected behavior or failures. Prefer soft dependency patterns where the consuming playbook explicitly includes both roles in the correct order.
Consider Collections for Distribution
The Ansible ecosystem has been steadily shifting toward collections as the primary distribution and packaging mechanism. While standalone Galaxy roles still work, collections bundle roles alongside plugins, modules, and documentation into a single versioned, namespaced package. If your organization is distributing roles across teams or publishing them externally, packaging them inside a collection gives you a unified versioning strategy, namespace isolation, and better dependency management. The ansible-creator tool can scaffold both standalone roles and collection structures (replacing the older ansible-galaxy init command, which is being phased out), and Molecule supports testing roles within collections natively.
You don't have to choose one or the other. Many teams keep standalone roles during development and package them into a collection for distribution. The role structure itself doesn't change -- collections are a packaging layer, not a rewrite.
Wrapping Up
Scalable Ansible roles aren't about clever YAML tricks -- they're about discipline. Put user-configurable variables in defaults/, internal constants in vars/, and prefix everything with the role name. Define argument_specs so your role validates its inputs and documents its API automatically. Test with Molecule across every platform you claim to support, and run those tests in CI on every pull request. Use meta: flush_handlers to prevent cross-role timing issues, and always validate configuration before restarting a service.
The investment in proper role structure pays compound interest. A role that's tested, documented, and predictable gets reused. A role that's a fragile snowflake gets copied, forked, and eventually abandoned. Build for the former.
Use roles for everything. Even small projects benefit from role structure. It forces you to organize variables, templates, and handlers logically. -- DevToolbox, "Ansible: The Complete Guide for 2026"