How to Parse a CSV File in Bash?

Last updated: 2020-09-26

💻 Geeky Vibes! Check out our partner WrinkledT: Sustainable Style With a POP 💥 🌈 for the Soul (and the Planet!) 🌎💚

WrinkledT: Sustainable Style With a POP 💥

Parsing a comma-separated values file, i.e. a CSV file, from the bash shell can be challenging and prone to errors depending on the complexity of the CSV file. Though, this is a frequent task in many automation shell scripts or to quickly process and reformat data from a file downloaded in bash.

This post covers how to parse a CSV file with builtin Bash commands or by using the awk command line to parse more complex format. The solutions presented below can easily be adapted to other file formats like in a tab-separated values file, i.e. TSV file.

For the examples of this article, I use a CSV file from datahub.io with a list of countries and their 2 digit code (ISO 3166-1). The CSV file contains two fields Name and Code with 249 entries + 1 headers line which makes it a 250 lines file.

$ head -n5 countries.csv 
Name,Code
Afghanistan,AF
Åland Islands,AX
Albania,AL
Algeria,DZ
...

What is the CSV file format?

A CSV file is a file containing Comma Separated Values. Despite being used for decades and across a large range of software, this file format is actually not a standard file format. There is no formal specification for CSV files which leads to the complexity of parsing them.

Some software may tolerate different complex use cases like the support for multiline fields or some custom delimiters. The closest to a specification is the IETF RFC 4180 and IETF RFC 7111 which specifies the IANA mime-type for CSV as text/csv.

You can find a good overview of a tentative common definition of the CSV format in section 2 of the RFC 4180. A high-level summary may be as follow.

Each record is on a separate line separated by a line break
The last record in a file may or may not end with a line break
There may be an optional header line
Within the header and records, there may be one or more fields separated by a comma (,)
Each field may or may not be enclosed in double quotes
Fields should be enclosed in double-quotes when containing line breaks, double quotes, and comma
Fields containing double quotes should escape them with another double quote

A representation of such a file given the above criteria may look like below. The CRLF notation indicates a line break in the CSV file. The record line with row3 represents fields with escaped double quotes, blank space, and a line break.

header_column_1,header_column_2,header_column_3 CRLF
row1_field1,row1_field2,row1_field3 CRLF
row2_field1,row2_field2,row2_field3 CRLF
"row3""field1","row3 field2","row3 CRLF
field3" CRLF
row4_field1,row4_field2,row4_field3

When trying to parse a CSV file in bash, it is important to understand the data source origin and whether you should expect to support complex formats. In some cases, you may have no other choice but to use an external library to support some of the complex formattings.

How to parse a CSV file in Bash?

Using Bash Builtins

To loop over our sample data, the easiest way is to read over the file and use the Internal Field Separator (IFS). You can read more about the IFS variable in my post on bash environment variables. To read each line of the csv file you can use the builtin command read which read a line from the standard input and split it into fields, assigning each word to a variable. The -r option prevents backslashes \ to escape any characters. Without the -r option, unescaped backslashed would be discared instead of being represented as a character.

Note that read will need a variable name for each field that you want to capture and the last one specified would simply be a catch-all of the remaining fields.

👉 This method is recommended only for simple CSV files with no text fields containing extra comma , delimiter, or return lines. For more complex CSV support, see the next section to parse CSV with AWK.

Below is a simple example with IFS set with the comma (,) field separator of the CSV format, and read set with the two expected field name and code which would be accessible inside the while loop as variables $name and $code.

while IFS=, read -r name code; do
  # do something... Don't forget to skip the header line!
  [[ "$name" != "Name" ]] && echo "$name"
done < countries.csv

⚠️ There is a catch with this methodology though. It doesn’t support the full CSV specification and won’t work as you would expect with the given dataset. If you look carefully at the output data, some of it returns incomplete values as some fields in the CSV file are text fields that contain the comma , separator and are enclosed in double quotes ".

...
United States
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
"Venezuela
Viet Nam
"Virgin Islands
"Virgin Islands
Wallis and Futuna
Western Sahara
...

You can figure out how many bad entries we have with another while loop, a simple regex, and a counter using Arithmetic Expansion.

👉 If you are not familiar with doing simple Math in Bash, check the post Performing Math Calculation In Bash. The example below uses the compound command to increment the count variable.

count=0
while IFS=, read -r name code; do
  # do something...
  [[ "$code" == *","* ]] && echo "$name $code" && ((++count))
done < countries.csv; \
echo ">> we found ${count} bad entries"

"Bolivia  Plurinational State of",BO
"Bonaire  Sint Eustatius and Saba",BQ
"Congo  the Democratic Republic of the",CD
"Iran  Islamic Republic of",IR
"Korea  Democratic People's Republic of",KP
"Korea  Republic of",KR
"Macedonia  the Former Yugoslav Republic of",MK
"Micronesia  Federated States of",FM
"Moldova  Republic of",MD
"Palestine  State of",PS
"Saint Helena  Ascension and Tristan da Cunha",SH
"Taiwan  Province of China",TW
"Tanzania  United Republic of",TZ
"Venezuela  Bolivarian Republic of",VE
"Virgin Islands  British",VG
"Virgin Islands  U.S.",VI
>> we found 16 bad entries

Over 6% of the entries will return incomplete data. So, unless you are sure to not have such text fields, I would not recommend using this first method.

👉 This example used a Bash If Statement construct. You can read more about the use of the double square brackets [[ with my post How To Script Powerful Bash If Statement?.

Using the awk Command Line

Awk is a domain-specific language designed for text processing. It is available on most Unix-like system, unfortunately, there may be a lot of variation between implementations and versions. In our example, we will use the powerful GNU awk which is probably the most complete implementation of awk.

👉 This method is recommended for complex CSV file with no text fields containing newline delimiters like the \n or \r characters. If you require to support fields with line breaks, then see the next section about full CSV specification support.

With the same countries.csv dataset as in our first example, we are now going to parse our CSV with an implementation using Fields Patterns (FPAT). We will be careful to consider that fields are separated by commas (,) while ignoring the ones that are in fields surrounded by quotes ". The FPAT = "([^,]+)|(\"[^\"]+\")" definition can be break down as follow:

The first part ([^,]+) is looking up for all fields that match any characters but the comma ,

The second part (\"[^\"]+\") is looking up for all fields that match any characters but the quote " and is surrounded by quotes ".

Below is an example of implementation using awk to parse a CSV file using FPAT.

gawk '
BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"
    count=0
}
{
  if ($1 != "Name") { # Do not forget to skip the header line!
    printf("%s\n", $1)
    ++count
  }
}
END {
  printf("Number of entries: %s\n", count)
}
' countries.csv

Success! We are now getting all the countries’ name properly populated.

⚠️ This approach still does not support the complete CSV specification. If your text fields contain return lines or other oddities, then this parsing will fail. It may be ok in most cases when the content format is known in advance. If some fields contain manual user entries then you may be at risk of errors.

Advanced CSV support

There is no simple way to support the full CSV implementations with just bash builtins or awk given the numerous CSV specifications and implementations. Despite the wide adoption of this format, it is complex to support properly the many edge cases. To support a full CSV implementation from your shell script, you will need to use a more advanced solution.

A first alternative to parse complex CSV file from a shell script is to use csvkit. For example, you can process the data with csvkit to turn it into a JSON format, and then do more advanced work with a tool like jq a lightweight and flexible command-line JSON processor. csvkit provides multiple command-line utilities to import, export, parse, sort, join, clean, and format csv files. The official tutorial is quite complete.

Another option is to use the standard csv module in python. It is fairly straight forward to implement. You can use the reader or DictReader class from the csv python module. Potentially, if you don’t want to implement everything in python, you can just preprocess your csv files and clean the fields to ensure they are formatted the way you expect. Then you can still process the clean CSV output with bash or awk as in our previous examples.

#!/usr/local/bin/python
# csv-reader.py: Example of CSV parsing in python
import csv 

with open('countries.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print('Country code', row['Code'], 'is for', row['Name'])

Below is an example output of the csv-reader.py script on the sample csv file with countries name and code.

[me@linux: ~]$ python3 csv-reader.py | head -n 5
Country code AF is for Afghanistan
Country code AX is for Åland Islands
Country code AL is for Albania
Country code DZ is for Algeria
Country code AS is for American Samoa

💻 Geeky Vibes! Check out our partner WrinkledT: Sustainable Style With a POP 💥 🌈 for the Soul (and the Planet!) 🌎💚

WrinkledT: Sustainable Style With a POP 💥