Parsing a comma-separated values file, i.e. a CSV file, from the bash shell can be challenging and prone to errors depending on the complexity of the CSV file. Though, this is a frequent task in many automation shell scripts or to quickly process and reformat data from a file downloaded in bash.
This post covers how to parse a CSV file with builtin Bash commands or by using the awk command line to parse more complex format. The solutions presented below can easily be adapted to other file formats like in a tab-separated values file, i.e. TSV file.
For the examples of this article, I use a
CSV file from datahub.io with a list of countries and their 2 digit code (ISO 3166-1). The CSV file contains two fields Name
and Code
with 249 entries + 1 headers line which makes it a 250 lines file.
$ head -n5 countries.csv
Name,Code
Afghanistan,AF
Γ
land Islands,AX
Albania,AL
Algeria,DZ
...
What is the CSV file format?
A CSV file is a file containing Comma Separated Values. Despite being used for decades and across a large range of software, this file format is actually not a standard file format. There is no formal specification for CSV files which leads to the complexity of parsing them.
Some software may tolerate different complex use cases like the support for multiline fields or some custom delimiters. The closest to a specification is the IETF RFC 4180 and IETF RFC 7111 which specifies the IANA mime-type for CSV as text/csv.
You can find a good overview of a tentative common definition of the CSV format in section 2 of the RFC 4180. A high-level summary may be as follow.
- Each record is on a separate line separated by a line break
- The last record in a file may or may not end with a line break
- There may be an optional header line
- Within the header and records, there may be one or more fields separated by a comma (
,
) - Each field may or may not be enclosed in double quotes
- Fields should be enclosed in double-quotes when containing line breaks, double quotes, and comma
- Fields containing double quotes should escape them with another double quote
A representation of such a file given the above criteria may look like below. The CRLF notation indicates a line break in the CSV file. The record line with row3
represents fields with escaped double quotes, blank space, and a line break.
header_column_1,header_column_2,header_column_3 CRLF
row1_field1,row1_field2,row1_field3 CRLF
row2_field1,row2_field2,row2_field3 CRLF
"row3""field1","row3 field2","row3 CRLF
field3" CRLF
row4_field1,row4_field2,row4_field3
When trying to parse a CSV file in bash, it is important to understand the data source origin and whether you should expect to support complex formats. In some cases, you may have no other choice but to use an external library to support some of the complex formattings.
How to parse a CSV file in Bash?
Using Bash Builtins
To loop over our sample data, the easiest way is to read over the file and use the Internal Field Separator (IFS
). You can read more about the IFS
variable in my post on
bash environment variables. To read each line of the csv file you can use the builtin command read
which read a line from the standard input and split it into fields, assigning each word to a variable. The -r
option prevents backslashes \
to escape any characters. Without the -r
option, unescaped backslashed would be discared instead of being represented as a character.
Note that read
will need a variable name for each field that you want to capture and the last one specified would simply be a catch-all of the remaining fields.
π This method is recommended only for simple CSV files with no text fields containing extra comma
,
delimiter, or return lines. For more complex CSV support, see the next section to parse CSV with AWK.
Below is a simple example with IFS
set with the comma (,
) field separator of the CSV format, and read
set with the two expected field name
and code
which would be accessible inside the
while loop as variables $name
and $code
.
while IFS=, read -r name code; do
# do something... Don't forget to skip the header line!
[[ "$name" != "Name" ]] && echo "$name"
done < countries.csv
β οΈ There is a catch with this methodology though. It doesn’t support the full CSV specification and won’t work as you would expect with the given dataset. If you look carefully at the output data, some of it returns incomplete values as some fields in the CSV file are text fields that contain the comma ,
separator and are enclosed in double quotes "
.
...
United States
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
"Venezuela
Viet Nam
"Virgin Islands
"Virgin Islands
Wallis and Futuna
Western Sahara
...
You can figure out how many bad entries we have with another while loop, a simple regex, and a counter using Arithmetic Expansion.
π If you are not familiar with doing simple Math in Bash, check the post Performing Math Calculation In Bash. The example below uses the compound command to increment the
count
variable.
count=0
while IFS=, read -r name code; do
# do something...
[[ "$code" == *","* ]] && echo "$name $code" && ((++count))
done < countries.csv; \
echo ">> we found ${count} bad entries"
"Bolivia Plurinational State of",BO
"Bonaire Sint Eustatius and Saba",BQ
"Congo the Democratic Republic of the",CD
"Iran Islamic Republic of",IR
"Korea Democratic People's Republic of",KP
"Korea Republic of",KR
"Macedonia the Former Yugoslav Republic of",MK
"Micronesia Federated States of",FM
"Moldova Republic of",MD
"Palestine State of",PS
"Saint Helena Ascension and Tristan da Cunha",SH
"Taiwan Province of China",TW
"Tanzania United Republic of",TZ
"Venezuela Bolivarian Republic of",VE
"Virgin Islands British",VG
"Virgin Islands U.S.",VI
>> we found 16 bad entries
Over 6% of the entries will return incomplete data. So, unless you are sure to not have such text fields, I would not recommend using this first method.
π This example used a Bash If Statement construct. You can read more about the use of the double square brackets
[[
with my post How To Script Powerful Bash If Statement?.
Using the awk Command Line
Awk is a domain-specific language designed for text processing. It is available on most Unix-like system, unfortunately, there may be a lot of variation between implementations and versions. In our example, we will use the powerful GNU awk which is probably the most complete implementation of awk.
π This method is recommended for complex CSV file with no text fields containing newline delimiters like the
\n
or\r
characters. If you require to support fields with line breaks, then see the next section about full CSV specification support.
With the same countries.csv dataset as in our first example, we are now going to parse our CSV with an implementation using
Fields Patterns (FPAT). We will be careful to consider that fields are separated by commas (,
) while ignoring the ones that are in fields surrounded by quotes "
. The FPAT = "([^,]+)|(\"[^\"]+\")"
definition can be break down as follow:
- The first part
([^,]+)
is looking up for all fields that match any characters but the comma,
OR
- The second part
(\"[^\"]+\")
is looking up for all fields that match any characters but the quote"
and is surrounded by quotes"
.
Below is an example of implementation using awk
to parse a CSV file using FPAT
.
gawk '
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
count=0
}
{
if ($1 != "Name") { # Do not forget to skip the header line!
printf("%s\n", $1)
++count
}
}
END {
printf("Number of entries: %s\n", count)
}
' countries.csv
Success! We are now getting all the countries’ name properly populated.
β οΈ This approach still does not support the complete CSV specification. If your text fields contain return lines or other oddities, then this parsing will fail. It may be ok in most cases when the content format is known in advance. If some fields contain manual user entries then you may be at risk of errors.
Advanced CSV support
There is no simple way to support the full CSV implementations with just bash builtins or awk given the numerous CSV specifications and implementations. Despite the wide adoption of this format, it is complex to support properly the many edge cases. To support a full CSV implementation from your shell script, you will need to use a more advanced solution.
A first alternative to parse complex CSV file from a shell script is to use
csvkit. For example, you can process the data with csvkit
to turn it into a JSON format, and then do more advanced work with a tool like
jq a lightweight and flexible command-line JSON processor. csvkit
provides multiple command-line utilities to import, export, parse, sort, join, clean, and format csv files. The
official tutorial is quite complete.
Another option is to use the standard
csv module in python. It is fairly straight forward to implement. You can use the reader
or DictReader
class from the csv
python module. Potentially, if you don’t want to implement everything in python, you can just preprocess your csv files and clean the fields to ensure they are formatted the way you expect. Then you can still process the clean CSV output with bash or awk as in our previous examples.
#!/usr/local/bin/python
# csv-reader.py: Example of CSV parsing in python
import csv
with open('countries.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print('Country code', row['Code'], 'is for', row['Name'])
Below is an example output of the csv-reader.py
script on the sample csv file with countries name and code.
[me@linux: ~]$ python3 csv-reader.py | head -n 5
Country code AF is for Afghanistan
Country code AX is for Γ
land Islands
Country code AL is for Albania
Country code DZ is for Algeria
Country code AS is for American Samoa