How to Parse a CSV File in Bash?

  • HOME
  • >
  • BASH
  • >
  • How to Parse a CSV File in Bash?
Last Updated: 
subscribe to my newsletter

Parsing a comma-separated values (CSV) file from the command line can be challenging and prone to errors depending on the complexity of the CSV file. Though, this is a frequent task in many automation bash shell scripts or to quickly process and reformat data from a file download in bash.

This post covers how to parse a CSV file with builtin Bash commands or using AWK to parse more complex ones.

For the examples of this article, we will use a CSV file from datahub.io with a list of countries and their 2 digit code (ISO 3166-1). The CSV file contains two fields Name and Code with 249 entries + 1 headers line which makes it a 250 lines file.

$ head -n5 countries.csv 
Name,Code
Afghanistan,AF
Åland Islands,AX
Albania,AL
Algeria,DZ
...

Extract Specific Columns Using Bash Builtins

👉 This method is recommended only for simple CSV file with no text fields containing extra comma , delimiter or return lines. For more complex CSV support, see the next section to parse CSV with AWK.

If you were to loop over that data, the easiest way you may think of would be to read over the file and use the Internal Field Separator (IFS). To read over the line we will simply use the builtin command read which read a line from the standard input and split it into fields, assigning each words to a variable. The -r option prevent backslashes to escape any characters. Also, remember that read will need a variable name for each field that you want to capture and the last one specified would simply be a catch-all of the remaining fields. Check help read for more details on this builtin command.

Below is a simple example with IFS set with the comma (,) field separator of the CSV format, and read set with the two expected field name and code which would be accessible inside the while loop as variables $name and $code.

while IFS=, read -r name code; do
  # do something... Don't forget to skip the header line!
  [[ "$name" != "Name" ]] && echo "$name"
done < countries.csv

There is a catch with this methodology though. It doesn't support the full CSV specification and won't work as you would expect with the given dataset. If you look carefully at the output data, some of it return incomplete values. Indeed, some fields in the CSV are text fields that contain the comma , separator and are in quotes ".

...
United States
United States Minor Outlying Islands
Uruguay
Uzbekistan
Vanuatu
"Venezuela
Viet Nam
"Virgin Islands
"Virgin Islands
Wallis and Futuna
Western Sahara
...

Let's try to figure out how many bad entries we have with another while loop, a simple regex and a counter using Arithmetic Expansion.

👉 If you are not familiar with doing simple Math in Bash, check the post Performing Math Calculation In Bash. The example below uses the compound command to increment the count variable.

count=0
while IFS=, read -r name code; do
  # do something...
  [[ "$code" == *","* ]] && echo "$name $code" && ((++count))
done < countries.csv; \
echo ">> we found ${count} bad entires"
"Bolivia  Plurinational State of",BO
"Bonaire  Sint Eustatius and Saba",BQ
"Congo  the Democratic Republic of the",CD
"Iran  Islamic Republic of",IR
"Korea  Democratic People's Republic of",KP
"Korea  Republic of",KR
"Macedonia  the Former Yugoslav Republic of",MK
"Micronesia  Federated States of",FM
"Moldova  Republic of",MD
"Palestine  State of",PS
"Saint Helena  Ascension and Tristan da Cunha",SH
"Taiwan  Province of China",TW
"Tanzania  United Republic of",TZ
"Venezuela  Bolivarian Republic of",VE
"Virgin Islands  British",VG
"Virgin Islands  U.S.",VI
>> we found 16 bad entires

Over 6% of the entries will return incomplete data. So, unless you are sure to not have such text fields, I would not recommend using this first method.

👉 Read more about the Bash If Statement constructs and the use of [[ in my post How To Script Powerful Bash If Statement?.

Extract Specific Columns Using the awk Command Line

👉 This method is recommended for complex CSV file with no text fields containing newline delimiters like the \n or \r characters. See the next section about full CSV specification support.

Awk is a domain-specific language designed for text processing. It is available on most Unix-like system, unfortunately, there may be a lot of variation between implementations and versions. In our example, we will use the powerful GNU awk which is probably the most complete implementation of awk.

With the same countries.csv dataset as in our first example, we are now going to parse our CSV with an implementation using Fields Patterns (FPAT). We will be careful to consider that fields are separated by commas (,) while ignoring the ones that are in fields surrounded by quotes ". The FPAT = "([^,]+)|(\"[^\"]+\")" definition can be break down as follow:

  • The first part ([^,]+) is looking up for all fields that match any characters but the comma ,

OR

  • The second part (\"[^\"]+\") is looking up for all fields that match any characters but the quote " and is surrounded by quotes ".

Let see a full example implementation below.

gawk '
BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"
    count=0
}
{
  if ($1 != "Name") { # Do not forget to skip the header line!
    printf("%s\n", $1)
    ++count
  }
}
END {
  printf("Number of entries: %s\n", count)
}
' countries.csv

Success! We are now getting all the countries' name properly populated.

⚠️ This approach is still not a full CSV support. If your text fields contain return lines or other oddities, then this parsing will fail. It may be ok in most cases when the content format is known in advance. If some fields contain manual user entries then you may be at risk of errors.

Implementations for the complete CSV specifications

There is no simple way to support the full CSV implementation with just bash builtins or AWK. The CSV specification is complex and contains many edge cases.

To support a full CSV implementation from your shell script, you will need to use a more advanced solution like the CSV support in Python or by using come command-line tool like csvkit.

Related bash posts that you may like...
How To Create Simple Menu with the Shell Select Loop?
The select loop is not a regular shell loop. It can be used in Bash to generate a simple menu from which a user can select numbered options.
What is the Right Way to Loop in Bash?
Looping over a list of numbers or words is a building block in shell scripts. Learn how to write Bash loops, including for loop, while loop, and until loop.
What is the Best Way to Count Files in a Directory?
Learn how to count the number of files in a directory using the Linux command line ls, find, and a native bash shell solution with globs and arrays.
5 Mistakes To Avoid For Writing High-Quality Bash Comments
Adding comments in your Bash scripts is necessary to ensure maintainability over time. This post covers 5 Bash comments mistakes to avoid in your shell scripts.