How to Parse a CSV File in Bash?

  • HOME
  • >
  • BASH
  • >
  • How to Parse a CSV File in Bash?
Last Updated: 
Tags:  bash csv awk how to

Parsing a comma-separated values (CSV) file from the command line can be challenging and prone to errors depending on the complexity of the CSV file. Though, this is a frequent task in many automation scripts or to quickly process and reformat data from downloaded files. In this post we will cover some of the common ways to parse simple files with builtin Bash commands or using AWK, and how to parse more complex CSV files.

For the examples of this article, we will use a CSV file from with a list of countries and their 2 digit code (ISO 3166-1). The CSV file contains two fields Name and Code with 249 entries + 1 headers line which makes it a 250 lines file.

$ head -n5 countries.csv 
Åland Islands,AX

Extract Specific Columns Using Bash Builtin

If you were to loop over that data, the easiest way you may think of would be to read over the file and use the Internal Field Separator (IFS). To read over the line we will simply use the builtin command read which read a line from the standard input and split it into fields, assigning each words to a variable. The -r option prevent backslashes to escape any characters. Also, remember that read will need a variable name for each field that you want to capture and the last one specified would simply be a catch-all of the remaining fields. Check help read for more details on this builtin command.

Below is a simple example with IFS set with the comma (,) field separator of the CSV format, and read set with the two expected field name and code which would be accessible inside the while loop as variables $name and $code.

while IFS=, read -r name code; do
  # do something... Don't forget to skip the header line!
  [[ "$name" != "Name" ]] && echo "$name"
done < countries.csv

There is a catch with this methodology though. It doesn't support the full CSV specification and won't work as you would expect with the given dataset. If you look carefully at the output data, some of it return incomplete values. Indeed, some fields in the CSV are text fields that contain the comma , separator and are in quotes ".

United States
United States Minor Outlying Islands
Viet Nam
"Virgin Islands
"Virgin Islands
Wallis and Futuna
Western Sahara

Let's try to figure out how many bad entries we have with another while loop, a simple regex and a counter using Arithmetic Expansion.

👉 If you are not familliar with doing simple Math in Bash, check the post Performing Math Calculation In Bash. The example below uses the compound command to increment the count variable.

while IFS=, read -r name code; do
  # do something...
  [[ "$code" == *","* ]] && echo "$name $code" && ((++count))
done < countries.csv; \
echo ">> we found ${count} bad entires"
"Bolivia  Plurinational State of",BO
"Bonaire  Sint Eustatius and Saba",BQ
"Congo  the Democratic Republic of the",CD
"Iran  Islamic Republic of",IR
"Korea  Democratic People's Republic of",KP
"Korea  Republic of",KR
"Macedonia  the Former Yugoslav Republic of",MK
"Micronesia  Federated States of",FM
"Moldova  Republic of",MD
"Palestine  State of",PS
"Saint Helena  Ascension and Tristan da Cunha",SH
"Taiwan  Province of China",TW
"Tanzania  United Republic of",TZ
"Venezuela  Bolivarian Republic of",VE
"Virgin Islands  British",VG
"Virgin Islands  U.S.",VI
>> we found 16 bad entires

Over 6% of the entries will return incomplete data. So, unless you are sure to not have such text fields, I would not recommend using this first method.

Extract Specific Columns Using GNU awk

Awk is a domain-specific language designed for text processing. It is available on most Unix-like system, unfortunately, there may be a lot of variation between implementations and versions. In our example, we will use the powerful GNU awk which is probably the most complete implementation of awk.

With the same countries.csv dataset as in our first example, we are now going to parse our CSV with an implementation using Fields Patterns (FPAT). We will be carefull to consider that fields are separated by commas (,) while ignoring the ones that are in fields surrounded by quotes ". The FPAT = "([^,]+)|(\"[^\"]+\")" definition can be break down as follow:

  • The first part ([^,]+) is looking up for all fields that match any characters but the comma ,


  • The second part (\"[^\"]+\") is looking up for all fields that match any characters but the quote " and is surrounded by quotes ".

Let see a full example implementation below.

gawk '
    FPAT = "([^,]+)|(\"[^\"]+\")"
  if ($1 != "Name") { # Do not forget to skip the header line!
    printf("%s\n", $1)
  printf("Number of entries: %s\n", count)
' countries.csv

Success! We are now getting all the countries' name properly populated.

⚠️ This approach is still not a full CSV support. If your text fields contain return lines or other oddities, then this parsing will fail. It may be ok in most cases when the content format is known in advance. If some fields contain manual user entries then you may be at risk of errors.

Implementations for the complete CSV specifications

To support a full CSV implementation with edge cases, you will need to use a more advanced solution like the CSV support in Python or use something like csvkit.

Related bash posts that you may like
What is the Bash Null Command?
Learn about the Bash null command, also known as the POSIX shell colon command. This post cover concrete use cases and pitfalls to avoid.
How To Format Date and Time in Linux, macOS, and Bash?
Find out how to manipulate date and time on linux and macOS systems as well as natively in the Bash shell. This post covers all you need to know to format a date from your shell.
How To Use Option as Meta Key in macOS Terminal?
The Meta Key is a modifier key that can be quite helpful to improve your productivity while working in a terminal and bash. This post cover how to enable from the command line the Meta Key in macOS Terminal.