CSV Tooling on Linux: awk, csvkit, qsv, xan, and More

Q: Can I run SQL queries directly on CSV files in Linux?

Yes. The q tool allows you to run standard SQL SELECT statements directly against CSV and TSV files from the command line, treating each file as a table. It supports caching parsed results to disk for repeated queries on the same file. csvkit's csvsql command also supports SQL queries and can push data directly into SQLite or PostgreSQL.

Q: How do I concatenate many CSV files into one?

Use qsv cat rows *.csv or csvkit's csvstack *.csv. Both handle header deduplication correctly -- they include the header from the first file and strip it from all subsequent files. Avoid using plain cat on CSV files: it will concatenate the header rows from every file into the output.

sudowheel

If you spend time on Linux dealing with data -- logs, exports, pipeline outputs, intelligence feeds -- you are constantly handling CSV files. Spreadsheet applications are the wrong tool for this. They corrupt delimiters, mangle encoding, and cap out at a million rows. The shell is the right tool, and the ecosystem around CSV processing on Linux is richer than many people realize.

This article covers the full range: the coreutils primitives you already have, the Python-based csvkit that most sysadmins eventually reach for, and the modern Rust-based tools -- qsv and xan -- that process multi-gigabyte files as fast as your disk can feed them. It also covers q, which lets you run SQL directly against CSV files from the command line. Along the way there are notes on where each tool fits, and where it breaks down. If you are already using Python for Linux system administration, csvkit slots cleanly into those workflows.

Why Not Just Use awk?

The honest answer is: for simple CSV, awk works fine. If your data has no quoted fields, no embedded commas, no embedded newlines, and uses a single consistent delimiter, awk -F',' will get you where you need to go. It is fast, universal, and already installed on every Linux box you will ever touch.

The problem is that real-world CSV is rarely that clean. RFC 4180 -- the closest thing CSV has to a standard -- allows fields to contain commas, newlines, and double-quote characters, as long as the field itself is enclosed in double quotes. awk does not understand this. The moment someone exports data with a field like "New York, NY", your field splitting breaks silently. You get wrong output with no error, which is the worst kind of wrong.

Warning

Piping CSV through awk, cut, or sed is safe only when you fully control the data and can guarantee no quoted fields exist. For data from external sources -- API exports, database dumps, user-generated content -- always use a tool that parses RFC 4180 correctly.

That said, awk and cut are worth knowing for the cases where they do apply. They are zero-dependency, available everywhere, and extremely fast for the simple cases they cover. A quick cut -d',' -f1,3 to pull columns 1 and 3 from a clean file is much faster to type than installing a toolkit.

coreutils basics

# Extract columns 1 and 3 (only safe for unquoted CSV)
$ cut -d',' -f1,3 data.csv

# Print the second field of every line with awk
$ awk -F',' '{print $2}' data.csv

# Count rows (subtract 1 for header)
$ wc -l < data.csv

# Sort by the second column, skip header
$ { head -1 data.csv; tail -n +2 data.csv | sort -t',' -k2; }

csvkit: The Python Workhorse

lang Python install pip install csvkit scale up to ~500 MB best for day-to-day sysadmin work, format conversion

csvkit is a suite of command-line utilities for working with CSV, written in Python. It has been the standard recommendation for CSV work on Linux for well over a decade, and for good reason: it is RFC 4180-correct, has a clean interface, handles type inference, and ships with enough subcommands to cover most day-to-day needs without scripting.

Install it with pip:

$ pip install csvkit --break-system-packages

The main commands you will use are csvcut for column selection, csvgrep for row filtering, csvstat for summary statistics, csvjoin for joining two files on a key column, csvsort for ordering, and csvsql for running SQL queries against a CSV file or loading it into a database. csvlook renders a file as a formatted table in the terminal, which is useful for quick inspection.

csvkit examples

# Show column headers and their positions
$ csvcut -n data.csv

# Select columns by name
$ csvcut -c id,username,email data.csv

# Filter rows where 'status' equals 'active'
$ csvgrep -c status -m active data.csv

# Regex filter: rows where email ends in .gov
$ csvgrep -c email -r '\.gov$' data.csv

# Summary statistics for all columns
$ csvstat data.csv

# Render as a pretty table
$ csvlook data.csv | head -20

# Run a SQL query directly against the file
$ csvsql --query "SELECT username, email FROM data WHERE status = 'active'" data.csv

# Join two files on a shared column
$ csvjoin -c user_id users.csv orders.csv

csvkit also ships with in2csv, which converts other formats -- Excel, JSON, fixed-width -- into CSV. This makes it particularly useful when data arrives in .xlsx format and you need it in a pipeline.

$ in2csv report.xlsx | csvcut -c date,amount | csvstat

Note

csvkit is Python, which means it loads the entire file into memory for most operations. For files under a few hundred megabytes this is not a problem. For multi-gigabyte files, you will want qsv or xan instead.

q: SQL Directly on CSV Files

lang compiled binary install GitHub releases scale moderate files + caching best for SQL-style analytical queries

The q tool takes a different approach: instead of giving you CSV-specific subcommands, it lets you write standard SQL SELECT statements and treats CSV files as tables. If you already know SQL, the learning curve is nearly zero.

q is distributed as a compiled standalone binary with no Python dependency. Download the appropriate package from the project's GitHub releases page and install it to your /usr/local/bin.

q examples

# Basic SELECT with header detection (-H) and comma delimiter (-d ,)
$ q -H -d , "SELECT username, email FROM data.csv WHERE status = 'active'"

# Aggregate: count rows grouped by status
$ q -H -d , "SELECT status, COUNT(*) as n FROM data.csv GROUP BY status ORDER BY n DESC"

# JOIN two CSV files
$ q -H -d , "SELECT u.username, o.amount FROM users.csv u JOIN orders.csv o ON u.id = o.user_id"

# Write a cache to disk so the next query on the same file is faster
$ q -H -d , -C readwrite "SELECT * FROM data.csv WHERE amount > 1000"
$ q -H -d , -C read "SELECT COUNT(*) FROM data.csv"

The caching mechanism is genuinely useful when you are running multiple exploratory queries against the same large file. On the first run with -C readwrite, q parses the CSV and stores a cache alongside it. Subsequent queries with -C read skip parsing entirely and hit the cache, which is dramatically faster on files of any significant size.

q is the right tool when your task is analytical -- you are trying to answer a question about the data -- and you think in SQL rather than in Unix pipelines.

qsv: The Rust Powerhouse

lang Rust install GitHub releases binary scale multi-GB streaming best for production pipelines, schema validation

qsv (pronounced "Quicksilver") is a fork and heavy expansion of the original xsv toolkit, rewritten and extended in Rust. Where csvkit is a toolkit for day-to-day use on moderate-sized files, qsv is engineered for production data pipelines, automated processing, and files that are simply too large for Python-based tools to handle efficiently.

Most qsv commands are streaming: they process the input row by row and use constant memory regardless of file size. The quickest install on a Debian/Ubuntu system is to grab the latest release binary directly:

qsv install

# Download the latest release (adjust version and arch as needed)
$ curl -L https://github.com/jqnatividad/qsv/releases/latest/download/qsv-x86_64-unknown-linux-musl.zip \
    -o qsv.zip
$ unzip qsv.zip qsv -d /usr/local/bin/
$ chmod +x /usr/local/bin/qsv

# Or via cargo if you prefer to build from source
$ cargo install qsv --locked --features all_full

Once installed, verify the full list of available subcommands:

$ qsv --list

That command shows you the full list of available subcommands -- there are over fifty of them at this point. The core ones map to familiar operations: headers, stats, select, search, filter, sort, dedup, join, split, slice, and frequency. Beyond those, qsv adds operations you rarely find elsewhere: validate (check a CSV against a JSON Schema), schema (infer a JSON Schema from a CSV), excel (convert directly from xlsx/xls), and luau (apply Lua transformations per-row).

qsv examples

# Show headers
$ qsv headers data.csv

# Descriptive statistics for all columns
$ qsv stats data.csv

# Select specific columns
$ qsv select id,username,email data.csv

# Search for rows where 'status' column matches a regex
$ qsv search -s status '^active$' data.csv

# Sort by a column (descending) -- loads into memory; use extsort for huge files
$ qsv sort -s amount --reverse data.csv

# Remove duplicate rows (by all columns)
$ qsv dedup data.csv

# Build a frequency table for the 'country' column
$ qsv frequency -s country data.csv

# Convert an Excel file to CSV
$ qsv excel report.xlsx

# Infer a JSON Schema from the data (writes data.csv.schema.json automatically)
$ qsv schema data.csv

# Validate a CSV against that schema
$ qsv validate data.csv data.csv.schema.json

qsv's stats command is notably fast because it builds an index on first run and can operate in parallel on subsequent runs. The same stats operation that takes minutes in Python runs in seconds against a multi-gigabyte file.

Pro Tip: Indexing changes how you design large-file pipelines

Most qsv commands that touch all rows run in time proportional to file size -- unavoidable when the data must be read. But qsv index data.csv writes a data.csv.idx file rapidly and enables a different class of operation. qsv slice with an index is constant time regardless of file size -- extracting any range of rows from a 15 GB file takes the same time as extracting row 1. qsv stats, qsv frequency, and qsv schema with an index can exploit all available CPU cores in parallel. According to the qsv documentation, indexing the 15 GB, 28-million-row NYC 311 dataset takes 14 seconds. The same stats operation that runs slowly as a full sequential scan completes in seconds with the index. Index files are invalidated automatically when the source file is modified, so there is no stale-index risk in pipelines that overwrite files.

qsv index: constant-time access into large files

# Build the index (done once; invalidated automatically on file change)
$ qsv index network_flows.csv

# Extract rows 5000000-5001000 in constant time regardless of file size
$ qsv slice --start 5000000 --len 1000 network_flows.csv | qsv table

# Parallel stats across all columns -- exploits all CPU cores with index
$ qsv stats --everything network_flows.csv | qsv table

# Sample 10000 random rows for quick exploratory analysis
$ qsv sample 10000 network_flows.csv | xan view

qsv's luau command deserves special mention. It embeds the Luau scripting language (the fast, statically typed dialect of Lua used by Roblox) and lets you apply arbitrary per-row transformations without leaving the pipeline. This is the escape hatch for operations that do not fit neatly into any built-in subcommand.

qsv luau transformation

# Add a new 'full_name' column by concatenating first and last
$ qsv luau map full_name 'first_name .. " " .. last_name' data.csv

# Compute a normalized score column
$ qsv luau map score_norm 'tonumber(score) / 100.0' data.csv

xan: The CSV Magician

lang Rust install cargo install xan scale multi-GB, SIMD-accelerated best for complex transforms, exploratory analysis

xan is a newer Rust-based CSV toolkit developed by the médialab at Sciences Po Paris. It started as a fork of xsv but has been almost entirely rewritten to fit the lab's use cases in social science research and web data analysis. The xsv project's own README now recommends xan as one of its two successors.

What sets xan apart from qsv is its expression language -- called Moonblade -- and its SIMD-accelerated CSV parser. Moonblade is a minimalistic language designed specifically for CSV row transformations. It is faster than evaluating Python, Lua, or JavaScript per-row, and its syntax is concise enough to use directly on the command line without a separate script file.

Install via Cargo:

$ cargo install xan --locked

Or, on NixOS/Nix, xan is available in nixpkgs as of the 25.05 release. Pre-built binaries are also attached to each GitHub release.

xan examples

# View a CSV as a formatted table in the terminal
$ xan view data.csv

# Show headers
$ xan headers data.csv

# Filter rows using Moonblade expression language
$ xan filter 'status == "active"' data.csv

# Add a computed column (expression first, then new column name)
$ xan map 'first_name + " " + last_name' full_name data.csv

# Group by a column and sum another
$ xan groupby country 'sum(amount) as total' data.csv | xan view

# Join two files on matching column names
$ xan join user_id users.csv orders.csv

# Draw a bar chart of a frequency column in the terminal
$ xan frequency -s country data.csv | xan plot bar country count

# Read a gzip-compressed CSV directly without decompressing first
$ xan stats data.csv.gz

xan's native gzip support is genuinely convenient: it reads .gz and .zst compressed files transparently, which matters when your CSV data comes from pipelines that compress everything by default. It also supports a wide range of adjacent formats: web archival CDX files, bioinformatics formats (VCF, GTF, SAM, BED), and direct conversion to and from JSON, Excel, and NumPy arrays via xan to and xan from.

The Moonblade expression language is xan's strongest differentiator. It has a proper cheatsheet accessible via xan help cheatsheet, a full function reference via xan help functions, and aggregation functions via xan help aggs. For teams doing complex data transformation at the command line, this is significantly more ergonomic than reaching for a Python script.

Note

xan and qsv are both excellent tools. qsv has a larger command set and stronger focus on data pipeline automation, including schema validation and self-updating binaries. xan has a more expressive inline language and is particularly well suited to exploratory analysis and complex transforms. Neither is strictly better -- they target slightly different workflows.

What Happened to xsv?

If you have been doing CSV work on Linux for a few years, you have almost certainly encountered xsv -- the Rust CSV toolkit by Andrew Gallant (BurntSushi) that was the standard recommendation from roughly 2015 through the early 2020s. xsv set the template for what a fast, streaming, composable CSV toolkit should look like, and it remains installable today via cargo install xsv.

The project is now in maintenance mode. The xsv README explicitly points readers toward its two successors: qsv (the fork focused on production pipelines and breadth of features) and xan (the fork focused on expression-language transforms and exploratory analysis). Both started from xsv's codebase and share its core architecture, so the transition from xsv is smooth -- most subcommand names are identical or close.

Migration

If you have existing shell scripts that call xsv, substituting qsv will work for the majority of commands without modification. The main differences are that qsv has renamed a small number of flags for consistency and added commands that did not exist in xsv. Run qsv --list and compare against your xsv usage to find anything that needs adjusting.

The power of all these tools comes from composing them via Unix pipes. They all read from stdin and write to stdout by default, which means they chain cleanly. Here are some practical pipeline patterns.

Building Real Pipelines

Process multiple files at once

Sysadmins rarely deal with a single CSV. Log rotation produces dozens of daily exports; vulnerability scanners write one file per host. The standard pattern is shell globbing into a pipeline, but the details matter when headers are involved -- you typically want to include the header from the first file only and strip it from all subsequent files. If you are already automating Linux server tasks with cron jobs and shell scripts, these CSV pipeline patterns compose directly with that work.

multi-file processing

# Concatenate multiple CSVs with consistent headers into one file
$ qsv cat rows logs/access-*.csv > combined.csv

# csvkit equivalent (handles header deduplication automatically)
$ csvstack logs/access-*.csv > combined.csv

# Run stats across all CSV files in a directory
$ for f in scans/*.csv; do echo "=== $f ==="; qsv stats "$f" | qsv table; done

# With xan: process a glob and pipe results into a single view
$ cat logs/*.csv | xan dedup | xan frequency -s severity | xan view

qsv cat rows is the right tool for concatenating same-schema files: it streams each file in order and handles the header correctly by default. csvstack does the same within csvkit and additionally validates that all input files share the same column names, raising an error if they diverge -- useful when you cannot guarantee schema consistency across files.

In practice, log exports collected over weeks or months often exhibit schema drift: a column gets renamed upstream, a new field is added, or columns are reordered when a different team generates the export. Concatenating files with different schemas silently will corrupt your data. The correct approach is to normalize column order and selection before stacking:

schema-safe multi-file concatenation

# Inspect what headers each file actually has before concatenating
$ for f in logs/*.csv; do echo "=== $f ==="; qsv headers "$f"; done

# Select only the columns you need in a consistent order, then stack
# Output the first file with its header, then remaining files without
$ first=1
for f in logs/*.csv; do
    if [ "$first" = "1" ]; then
        qsv select date,src_ip,dst_port,action "$f"
        first=0
    else
        qsv select date,src_ip,dst_port,action "$f" | tail -n +2
    fi
done > normalized.csv

# csvkit approach: csvstack will error on schema mismatch (useful as a guard)
# Use --skipinitialspace to handle sloppy upstream formatting
$ csvstack --skipinitialspace logs/*.csv > combined.csv 2>schema-errors.txt
[ -s schema-errors.txt ] && echo "Schema drift detected:" && cat schema-errors.txt

Filter, transform, and summarize in one pass

pipeline: csvkit

# Select columns, filter rows, sort, output as pretty table
$ csvcut -c date,user,amount data.csv \
    | csvgrep -c amount -r '^[0-9]{4,}' \
    | csvsort -c amount -r \
    | csvlook | head -30

Process a large file with qsv and stream output to another tool

pipeline: qsv

# Select rows, deduplicate, sort, and write to a new file
# Note: qsv sort loads into memory; use qsv extsort for files larger than RAM
$ qsv search -s status '^active$' users.csv \
    | qsv dedup \
    | qsv sort -s created_at \
    > active_users_sorted.csv

# Frequency table for top 20 countries, piped to less
$ qsv frequency -s country --limit 20 data.csv | qsv table | less

Combine xan transforms with standard Unix tools

pipeline: xan + coreutils

# Compute a new column, filter, then count with wc
$ xan map 'round(score / max_score * 100, 1)' score_pct data.csv \
    | xan filter 'score_pct >= 90' \
    | wc -l

# Convert from gzipped newline-delimited JSON to CSV, then analyze
$ xan from -f ndjson events.jsonl.gz | xan groupby event_type 'count() as n' | xan view

Delimiters, TSV, and Headerless Files

Not everything that looks like CSV uses commas. Tab-separated values (TSV) from database exports, pipe-delimited files from legacy systems, and semicolon-separated exports from European spreadsheet applications are all common in practice. Every tool covered here handles alternate delimiters, but the syntax differs.

alternate delimiters

# csvkit: use -d to set delimiter, -t shorthand for TSV
$ csvcut -t -c 1,3 data.tsv
$ csvstat -d '|' data.psv

# qsv: -d flag for delimiter
$ qsv stats -d '\t' data.tsv
$ qsv select -d ';' id,name,amount data.csv

# xan: --delimiter flag
$ xan view --delimiter '\t' data.tsv

# q: -d flag for delimiter
$ q -d '\t' "SELECT * FROM data.tsv WHERE status = 'active'"

Headerless files are a separate problem. Many system log exports and legacy pipeline outputs omit a header row entirely. Again, each tool has a convention:

headerless files

# csvkit: use --no-header-row (-H), columns referenced by position (a, b, c...)
$ csvcut -H -c 1,3 data.csv
$ csvgrep -H -c 2 -m active data.csv

# qsv: --no-headers flag
$ qsv stats --no-headers data.csv
$ qsv select --no-headers 1,3 data.csv

# xan: --no-headers flag
$ xan filter --no-headers 'col(2) == "active"' data.csv

# q: omit -H (the -H flag enables header detection; without it, q uses col1, col2...)
$ q -d , "SELECT c1, c3 FROM data.csv WHERE c2 = 'active'"

Warning

csvkit's behavior with headerless files changed subtly between versions. If csvcut -H produces unexpected output, verify that your csvkit version is current. The -H flag suppresses the header row in output, which is distinct from telling the tool the input has no headers -- use --no-header-row for the latter.

Choosing the Right Tool

The decision tree is not complicated. Use awk or cut for simple, unquoted CSV when you know the data is clean and you want zero dependencies. Use csvkit when you need a reliable, batteries-included toolkit for moderate-sized files and you want Python's type inference and in2csv format conversion. Use q when your task is analytical and you think in SQL. Use qsv when you are building automated pipelines, handling large files, or need schema validation. Use xan when you need complex per-row transformations, exploratory analysis in the terminal, or are working with compressed or adjacent data formats.

CSV tool comparison: RFC 4180 support, large file handling, SQL queries, per-row transforms, compressed input, and zero-install availability across awk, csvkit, q, qsv, and xan
Tool	Language	RFC 4180	Large files	SQL queries	Per-row transforms	Compressed input	Zero install
awk / cut	C (coreutils)	no	yes	no	limited	no	yes
csvkit	Python	yes	~500 MB	yes (csvsql)	limited	no	no
q	binary	yes	with cache	yes (native)	no	no	no
qsv	Rust	yes	yes (streaming)	yes (luau)	yes (luau)	via pipe	no
xan	Rust	yes	yes (SIMD)	via groupby	yes (Moonblade)	yes (.gz/.zst)	no

awk / cut ▼

LanguageC (coreutils)

RFC 4180no

Large filesyes

SQL queriesno

Per-row transformslimited

Compressed inputno

Zero installyes

csvkit ▼

LanguagePython

RFC 4180yes

Large files~500 MB

SQL queriesyes (csvsql)

Per-row transformslimited

Compressed inputno

Zero installno

q ▼

Languagecompiled binary

RFC 4180yes

Large fileswith cache

SQL queriesyes (native)

Per-row transformsno

Compressed inputno

Zero installno

qsv ▼

LanguageRust

RFC 4180yes

Large filesyes (streaming)

SQL queriesyes (luau)

Per-row transformsyes (luau)

Compressed inputvia pipe

Zero installno

xan ▼

LanguageRust

RFC 4180yes

Large filesyes (SIMD)

SQL queriesvia groupby

Per-row transformsyes (Moonblade)

Compressed inputyes (.gz/.zst)

Zero installno

These tools are not mutually exclusive in a pipeline. It is entirely reasonable to convert a file with in2csv, validate it with qsv validate, run a quick analytical query with q, and render the result with xan view -- each tool doing what it does best.

Pro Tip

Both qsv and xan require UTF-8 input. The qsv-native way to transcode is qsv input --output utf8.csv latin1.csv — the input command handles transcoding and replaces invalid UTF-8 sequences with the replacement character (U+FFFD) rather than aborting. For more control over what happens to unmappable bytes, iconv is the standard fallback. A single iconv call handles files with a uniform non-UTF-8 encoding, but production data frequently contains mixed encoding — some rows latin1, some UTF-8, some Windows-1252, all in the same file. In that case iconv will abort at the first byte it cannot map. The more robust approach is uchardet data.csv to detect the dominant encoding, then iconv -f $(uchardet data.csv) -t UTF-8 -c data.csv > clean.csv. The -c flag silently drops unconvertible characters rather than failing, preserving throughput. For pipelines where encoding errors must be logged rather than discarded, redirect stderr: iconv -f WINDOWS-1252 -t UTF-8 data.csv > clean.csv 2>encoding-errors.log.

CSV in a Security Context

CSV files show up constantly in security work: exported firewall logs, SIEM query results, vulnerability scanner outputs, asset inventories, and IOC feeds. Knowing your way around these tools pays dividends when you are triaging a large log export or correlating data across multiple sources at the command line. The volume of exported log data is often driven by systemd-journald configuration -- understanding how journald manages retention and rotation will help you predict the size and frequency of log exports your pipelines need to handle.

Malformed files and error output

Externally sourced CSV is frequently malformed: mismatched quote characters, rows with the wrong field count, encoding issues, or embedded newlines that confuse row detection. Understanding how each tool surfaces these problems saves time when a pipeline silently produces wrong output.

diagnosing malformed CSV

# qsv validate: check RFC 4180 compliance
# Writes invalid rows to data.invalid, valid rows to data.valid; exits non-zero on failure
$ qsv validate data.csv

# Infer a JSON Schema from a reference file (writes data.csv.schema.json)
$ qsv schema data.csv

# Validate incoming file against the committed schema
# Invalid rows go to incoming.invalid, valid rows to incoming.valid
$ qsv validate incoming.csv data.csv.schema.json
$ qsv count incoming.invalid  # how many rows failed
$ qsv stats incoming.invalid | qsv table  # inspect the bad rows

# Check field count consistency with awk (fast pre-check, no dependencies)
$ awk -F',' 'NF != 7 {print NR": "NF" fields"}' data.csv | head -20

# csvclean: csvkit's repair tool -- writes data_out.csv and data_err.csv
$ csvclean data.csv

qsv validate exits with a non-zero status code on any error, which makes it usable as a hard gate in automated pipelines. When validation fails, it writes invalid rows to <input>.invalid and valid rows to <input>.valid — both are plain CSV files you can process immediately with the same toolchain. A practical pattern for a cron-driven ingestion pipeline:

validate as a pipeline gate

# Step 1: Infer a schema from a known-good reference file (run once, commit the .schema.json)
# qsv schema writes <input>.schema.json automatically; no --stdout flag needed
$ qsv schema reference.csv
# produces: reference.csv.schema.json

# Step 2: Validate incoming file against that schema
# On failure, writes invalid rows to incoming.invalid and valid rows to incoming.valid
$ qsv validate incoming.csv reference.csv.schema.json
if [ $? -ne 0 ]; then
    invalid_count=$(qsv count incoming.invalid 2>/dev/null || echo "unknown")
    echo "ALERT: $invalid_count invalid rows in incoming.csv" | mail -s "CSV validation failed" ops@example.com
    # Process the valid portion immediately -- don't discard it
    mv incoming.valid incoming_clean.csv
    exit 1
fi

# Step 3: Inspect invalid rows to distinguish schema drift from data quality problems
$ qsv stats incoming.invalid | qsv table
$ csvlook incoming.invalid | head -20

The distinction between schema drift and data quality problems matters for remediation routing. Schema drift — an upstream column being renamed, reordered, or dropped — will cause many rows to fail in a consistent pattern visible in qsv stats incoming.invalid. Individual data quality problems like a numeric column receiving free-text values will scatter across random rows. Separating these lets you route them to different teams without reading every failure row manually.

csvclean takes a different approach within csvkit: it repairs what it can and writes clean rows to data_out.csv and bad rows to data_err.csv, which is useful when you want to process the good portion of a file without discarding the whole thing.

It is worth noting that CSV itself is not an inert format from a security perspective. commercial surveillance vendors and exploit chain operators have long used data formats including CSV as delivery vectors -- particularly CSV injection, where a cell value beginning with =, +, -, or @ can trigger formula execution when opened in a spreadsheet application. The less-cited vectors are DDE payloads (tab-prefixed commands that Excel and LibreOffice will prompt to execute) and row injection via embedded \r characters, which cause a single cell value to be parsed as multiple rows by applications that handle line endings inconsistently. When generating CSV output that might be imported into Excel or LibreOffice Calc, sanitizing at the command line before the file leaves the server is the right point of control. CSV exports from asset inventories and user databases are particularly sensitive here -- if you need to audit Linux user permissions and export the results, sanitizing the output before it reaches a spreadsheet application is an important step.

CSV injection sanitization

# Prefix dangerous leading characters with a single quote using qsv luau
# Covers =, +, -, @, tab (DDE), and \r (row injection)
# In Luau, \9 is tab in patterns; \r is the carriage return literal
$ qsv luau map username \
    'if string.match(username, "^[=+%-@\9\r]") then "\x27" .. username else username end' \
    users.csv > users_safe.csv

# Simpler approach with xan: flag rows where any key field starts with injection chars
# Moonblade uses function syntax: match(column, pattern)
$ xan filter 'match(username, "^[=+\\-@]") or match(email, "^[=+\\-@]")' \
    users.csv | xan view

# Strip \r from all fields using qsv fmt (safe for pure terminal use)
$ qsv fmt --quote-all data.csv | sed 's/\r//g' > data_clean.csv

Similarly, if your CSV tooling is pulling data from web sources or external APIs, be aware that maliciously crafted responses can exploit parsing edge cases. Browser-level delivery of crafted data has been a reliable attack surface -- as illustrated by browser exploits delivered via crafted HTML pages. The same principle applies to any parser that consumes external input: CSV parsers that are not strictly RFC 4180 compliant can misbehave on adversarially crafted input.

For a broader understanding of how commercial surveillance vendors operate, and for context on how zero-day vulnerabilities in browser engines factor into modern threat actor toolchains, those links are worth reading alongside any work you are doing with data that crosses trust boundaries.

How to Set Up CSV Tooling on Linux

Step 1: Install a dedicated CSV toolkit

Install csvkit with pip install csvkit --break-system-packages, or install qsv by downloading the prebuilt binary from the GitHub releases page and placing it on your PATH. For xan, use cargo install xan --locked or install via your distribution's package manager if available.

Step 2: Inspect and filter your CSV data

Use csvkit's csvstat to get a column summary, or qsv stats to get descriptive statistics. To filter rows by a column value, use csvkit's csvgrep -c column_name -m value file.csv, or with qsv use qsv search -s column_name pattern file.csv.

Step 3: Join and transform CSV files in a pipeline

Chain commands together using Unix pipes. For example, use csvcut to select columns, pipe the result into csvgrep to filter rows, then pipe into csvstat for a summary. With qsv or xan, compose commands the same way: slice, filter, and aggregate in a single pipeline without writing intermediate files.

Step 4: Validate and clean malformed CSV

Use qsv validate to check RFC 4180 compliance -- it exits non-zero on any error and writes invalid rows to <input>.invalid and valid rows to <input>.valid, both as plain CSV. To validate against expected data types, first run qsv schema data.csv to infer a JSON Schema (written automatically to data.csv.schema.json), then run qsv validate data.csv data.csv.schema.json against subsequent files. For csvkit-based repair, csvclean writes clean rows to data_out.csv and error rows to data_err.csv.

Frequently Asked Questions

What is the fastest CSV tool for Linux?

For raw throughput on large files, qsv and xan are the fastest options available. Both are written in Rust, use streaming by default, and can exploit CPU parallelism. xan additionally leverages SIMD instructions for its CSV parser. For files under a few megabytes, the performance difference between any of these tools is negligible.

Can I run SQL queries directly on CSV files in Linux?

Yes. The q tool allows you to run standard SQL SELECT statements directly against CSV and TSV files from the command line, treating each file as a table. It supports caching parsed results to disk for repeated queries on the same file. csvkit's csvsql command also supports SQL queries and can push data directly into SQLite or PostgreSQL.

Is awk good enough for CSV processing on Linux?

awk works reliably when your CSV data is simple: no quoted fields, no embedded commas or newlines. The moment a field contains a comma inside quotes, awk's field splitting breaks. For anything beyond clean, unquoted CSV, a dedicated tool like csvkit, qsv, or xan will handle edge cases correctly and save you from writing fragile workarounds.

How do I process TSV or pipe-delimited files with these tools?

All the dedicated tools support alternate delimiters via a flag: -d '\t' for qsv and xan, -t (TSV shorthand) or -d for csvkit, and -d for q. For awk and cut, set -F'\t' or -F'|' respectively. The key is that the underlying RFC 4180 parser in qsv, xan, and csvkit handles quoted fields correctly even with non-comma delimiters, which awk does not.

I have scripts that use xsv. Should I migrate?

xsv is no longer actively developed and its maintainer now points users to qsv and xan. For most scripts, substituting qsv for xsv works without modification -- the core subcommands share names and flags. Check a few edge cases around sort and join flags which had minor changes. xan is less drop-in compatible with xsv scripts but is worth evaluating if you do heavy per-row transformation work. There is no urgency to migrate immediately since xsv still functions correctly, but you will not receive bug fixes or support for parsing edge cases.

How do I concatenate many CSV files into one?

Use qsv cat rows *.csv or csvkit's csvstack *.csv. Both handle header deduplication correctly -- they include the header from the first file and strip it from all subsequent files. csvstack additionally validates that all files share the same column names and will error if they do not, which is useful when schema consistency is not guaranteed. Avoid using plain cat on CSV files: it will concatenate the header rows from every file into the output.

^ back to top