How to Parse CSV Data in Bash

Jane Kelly/Shutterstock.com

Comma Separated Values ​​(CSV) files are one of the most common formats for exported data. On Linux, we can read CSV files with bash commands. But it can get very complicated very quickly. We help.

What is a CSV file?

A comma-separated values ​​file is a text file that contains tabulated data. CSV is a type of comma-delimited data. As the name suggests, a comma “,‘ is used to separate each data field – or value– from his neighbors.

CSV is everywhere. If an application has import and export capabilities, it almost always supports CSV. CSV files are human-readable. You can peek inside with less, open them in any text editor, and move them from program to program. For example, you can export the data from a SQLite database and open it in LibreOffice Calc.

But even CSV can get complicated. Do you want a comma in a data field? This field must have quotes “"‘ wrapped around it. To include quotation marks in a field, each quotation mark must be entered twice.

If you’re working with CSV generated by a program or script you’ve written, the CSV format is probably simple and straightforward. If you are forced to work with more complex CSV formats, since Linux is Linux, there are solutions that we can use for that as well.

Some example data

Sites like Online Data Generator make it easy to generate some sample CSV data. You can define the fields you want and choose how many rows of data you want. Your data is generated using realistic dummy values ​​and downloaded to your computer.

We created a file containing 50 lines of dummy employee information:

  • I would: A simple unique integer value.
  • first name: The person’s first name.
  • Surname: The last name of the person.
  • job title: The job title of the person.
  • Email-address: The person’s email address.
  • branch: The corporate industry in which they work.
  • Condition: The state where the branch is located.

Some CSV files have a header that lists the field names. Our example file has a. Here is the beginning of our file:

The sample CSV file

The first line contains the field names as comma-separated values.

Analyzing data from the CSV file

Let’s write a script that reads the CSV file and extracts the fields from each record. Copy this script into an editor and save it in a file called “field.sh”.

#! /bin/bash

while IFS="," read -r id firstname lastname jobtitle email branch state
do
  echo "Record ID: $id"
  echo "Firstname: $firstname"
  echo " Lastname: $lastname"
  echo "Job Title: $jobtitle"
  echo "Email add: $email"
  echo " Branch: $branch"
  echo " State: $state"
  echo ""
done < <(tail -n +2 sample.csv)

There is a lot packed into our little script. Let’s break it down.

We use a while Ribbon. As long as the while Ribbon Condition decides to be true, the body of the while loop is executed. The body of the loop is pretty simple. A collection of echo -instructions are used to print the values ​​of some variables in the terminal window.

That while The loop condition is more interesting than the loop body. We specify that a comma should be used as the internal field separator, with which IFS="," Expression. The IFS is an environment variable. That read The command refers to its value when parsing text sequences.

We use the read command -r (preserve backslashes) Option to ignore any backslashes that may be present in the data. They are treated as normal characters.

The text that the read The command analysis is stored in a set of variables named after the CSV fields. They might as well have been named field1, field2, ... field7 but meaningful names make life easier.

The data is obtained as output from tail Command. We use tail because it gives us an easy way to skip the csv file header. That -n +2 (line number) option says tail to read at line two.

That <(...) Construct is called process substitution. It causes Bash to accept the output of a process as if it came from a file descriptor. This is then redirected to the while Loop providing the text that the read Command is parsed.

Make the script executable with the chmod Command. You must do this every time you copy a script from this article. Replace each with the name of the appropriate script.

chmod +x field.sh

Make a script executable with chmod

When we run the script, the records are correctly split into their individual fields, with each field being stored in a different variable.

./field.sh

The CSV file parsed by the field.sh script.

Each record is printed as a set of fields.

select fields

Maybe we don’t want or need to fetch every field. We can get a selection of fields by including the cut Command.

This script is called “select.sh”.

#!/bin/bash

while IFS="," read -r id jobtitle branch state
do
  echo "Record ID: $id"
  echo "Job Title: $jobtitle"
  echo " Branch: $branch"
  echo " State: $state"
  echo ""
done < <(cut -d "," -f1,4,6,7 sample.csv | tail -n +2)

We added those cut Command in the process substitution clause. We use the -d (delimiter) option to say cut use commas”,“ as separator. That -f (field) option says cut We want fields one, four, six and seven. These four fields are read into four variables that are printed in the body of while Ribbon.

This is what we get when we run the script.

./select.sh

Parse the CSV file with field.sh to extract a specific selection of fields

By adding the cut command, we can select the fields we want and ignore the fields we don’t want.

So far, so good. But…

If the CSV you’re dealing with is straightforward and doesn’t contain commas or quotes in field data, then what we’ve covered will likely meet your CSV parsing needs. To show the problems we can encounter, we modified a small sample of the data to look like the following.

id,firstname,lastname,job-title,email-address,branch,state
1,Rosalyn,Brennan,"Steward, Senior",[email protected],Minneapolis,Maryland
2,Danny,Redden,"Analyst ""Budget""",[email protected],Venice,North Carolina
3,Lexi,Roscoe,Pharmacist,,Irlington,Vermont
  • Record one has a comma in the job-title field, so the field must be enclosed in quotes.
  • Record two contains a word enclosed in two quotation marks jobs-title set up.
  • Record three contains no data email-address set up.

This data was saved as “sample2.csv”. Modify your field.sh script to call sample2.csv and save it as field2.sh.

#! /bin/bash

while IFS="," read -r id firstname lastname jobtitle email branch state
do
  echo "Record ID: $id"
  echo "Firstname: $firstname"
  echo " Lastname: $lastname"
  echo "Job Title: $jobtitle"
  echo "Email add: $email"
  echo " Branch: $branch"
  echo " State: $state"
  echo ""
done < <(tail -n +2 sample2.csv)

When we run this script we can see cracks in our simple CSV parsers.

./field2.sh

Running field2.sh

The first record splits the job title field into two fields, with the second part treated as an email address. Each subsequent field is shifted one position to the right. The last field contains both the branch and the state Values.

A record with a field split into two fields

The second record keeps all the quotes. There should only be a single pair of quotation marks around the word “budget”.

A record with mishandled quotes

The third record actually treats the missing field as it should. The email address is missing but everything else is as it should be.

A record with a missing field that is handled correctly

In contrast, for a simple data format, it is very difficult to write a robust CSV parser for general cases. tools like awk lets you get close, but there are always edge cases and exceptions that slip through.

Trying to write an infallible CSV parser is probably not the best way forward. An alternative approach – especially if you’re working on a specific deadline – uses two different strategies.

One is to use a specially designed tool to manipulate and extract your data. The second is to clean your data and replace problem scenarios like embedded commas and quotes. Your simple bash parsers will then be able to cope with the bash-friendly CSV file.

The CSV toolkit csvkit is a collection of utilities specially created to work with CSV files. You need to install it on your computer.

Use this command to install it on Ubuntu:

sudo apt install csvkit

Install csvkit on Ubuntu

To install it on Fedora you need to type:

sudo dnf install python3-csvkit

Installing csvkit on Fedora

On Manjaro the command is:

sudo pacman -S csvkit

Installing csvkit on Manjaro

If we pass it the name of a CSV file, the csvlook displays a table with the contents of each field. The field content is displayed to show what the field content represents, not how it is stored in the CSV file.

let us try it csvlook with our problematic sample2.csv file.

csvlook sample2.csv

annoying CSV correctly parsed by csvlook

All fields are displayed correctly. This proves that the problem is not the CSV. The problem is that our scripts are too simple to interpret the CSV file correctly.

To select specific columns, use the csvcut Command. That -c (column) option can be used with field names or column numbers or a mix of both.

Suppose we need to extract the first and last names, job titles, and email addresses from each record, but we want the name order to be “Last Name, First Name”. All we have to do is specify the field names or numbers in the order we want.

These three commands are all equivalent.

csvcut -c lastname,firstname,job-title,email-address sample2.csv
csvcut -c lastname,firstname,4,5 sample2.csv
csvcut -c 3,2,4,5 sample2.csv

Selecting fields in a preferred order using csvcut

We can add those csvsort Command to sort the output by a field. We use the -c (column) option to specify the column to sort by and the -r (reversed) Option to sort in descending order.

csvcut -c 3,2,4,5 sample2.csv | csvsort -c 1 -r

Select fields and sort by a single column

To make the output nicer, we can pad it csvlook .

csvcut -c 3,2,4,5 sample2.csv | csvsort -c 1 -r | csvlook

Using csvlook to pretty print the sorted selection of fields

A nice touch is that even though the records are sorted, the header with the field names is kept as the first row. Once we’re happy, we have the data the way we want it and we can remove it csvlook from the command chain and create a new CSV file by redirecting the output to a file.

We added more data to sample2.file which removed csvsort command and created a new file named “sample3.csv”.

csvcut -c 3,2,4,5 sample2.csv > sample3.csv

A safe way to clean CSV data

When you open a CSV file in LibreOffice Calc, each field is placed in a cell. You can use the find and replace function to search for commas. You could replace them with “nothing” to make them disappear, or with a character that doesn’t affect CSV parsing, such as a semicolon “;” for example.

You will not see the quotes around quoted fields. The only quotes you will see are the embedded quotes Inside field data. These are represented as single quotes. Find these and replace them with a single apostrophe “'” replaces the double quotes in the CSV file.

Using Find and Replace in LibreOffice Calc to replace quotation marks with apostrophes

Finding and replacing in an application like LibreOffice Calc means you can’t accidentally delete any of the field separator commas or the quotation marks around quoted fields. You just change those data values of fields.

We changed all commas in fields with semicolons and all embedded quotes with apostrophes and saved our changes.

The modified CSV file

We then created a script called field3.sh to parse sample3.csv.

#! /bin/bash

while IFS="," read -r lastname firstname jobtitle email
do
  echo " Lastname: $lastname"
  echo "Firstname: $firstname"
  echo "Job Title: $jobtitle"
  echo "Email add: $email"
  echo ""
done < <(tail -n +2 sample3.csv)

Let’s see what we get when we run it.

./field3.sh

A section of the correctly parsed CSV file

Our simple parser can now handle our previously problematic datasets.

You will see a lot of CSV

CSV is probably the closest thing to a slang term for application data. Most applications that process some form of data support CSV import and export. Knowing how to handle CSV in a realistic and practical way will benefit you.

TIED TOGETHER: 9 bash script examples to get you started with Linux

Leave a Reply

Your email address will not be published. Required fields are marked *