Working With CSV Files For Data Engineers
Comma-Separated Values (CSV) files are notorious for being tricky to work with among data engineers. In this post, I explain some of their quirks.
Pros of CSVs
To start with, CSV is a very common format for data interchange, especially between different organizations.
These are likely the 3 main reasons for CSV’s popularity:
- CSV is a textual format, so the contents are meaningful when the files are opened. Contrary to binary formats, this is coveted and handy because both the sender and receiver of the files can peek at the records, spot problems, and rectify them quickly.
- CSV is a more lightweight format than JSON or XML. The headers are stored once at the top of the file instead of repeated per record like in a JSON or XML. This means that CSV file sizes are generally smaller than the other textual formats.
- CSV can be easily opened by many text editors and processed by many popular libraries. Many storage systems or processing frameworks such as Pandas or Spark also provide native APIs to read and write data as CSV.
Cons of CSVs
Using CSV is sometimes troublesome because of these reasons:
- Lack of a standard
- Special characters like delimiter, quote, escape
- No schema or data types
- Huge file size
No Standard Specification
The first and biggest problem with CSV is that there is no formal specification.
The following is an excerpt of the RFC-4180 document by the Internet Engineering Task Force (IETF) regarding the CSV format:
While there are various specifications and implementations for the CSV format, there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.
This means that there are different interpretations of the CSV format. This leads to different implementations of CSV libraries that may deviate from expectations.
Basic Rules
Different CSV parsers typically agree on the basic standards for CSV files:
- Each record is located on a separate line, delimited by a line break
- There may be an optional header line appearing as the first line of the file
- Each line should contain the same number of fields throughout the file.
user_id,name,favorite_food
1,Alicia,Burger
2,Brandon,Mac and Cheese
However, things become complicated when it comes to more complex rules involving special values.
Use of Special Characters
In the CSV file format, there are certain characters that are considered special:
1. Delimiter
2. Quote
3. Escape
Delimiter
The delimiter character is probably the easiest to understand. These are the characters that separate the fields (or columns) in the header and each record.
The RFC-4180 document generally states that the delimiter should be a comma. After all, CSV is an abbreviation for Comma-Separated Values.
user_id,name,favorite_food
1,Alicia,Burger
2,Brandon,Mac and Cheese
However, some data producers like to provide their CSV files with a different delimiter. A common reason is that the field values within the source systems contain commas, which may be problematic if they were to use comma as the delimiter too. Hence, it is quite common to see CSV records with a different delimiter:
user_id|name|favorite_food
1|Alicia|Burger
2|Brandon|Mac and Cheese
Many CSV libraries, albeit not all of them, support other delimiters characters such as the pipe character, tabs, colon, and many others. It is advisable to pick a parser that is configurable.
Quote
The quote character is the next important character for the CSV format. In CSV files, each field of a row may or may not be enclosed in a quote character. The default quote character is a double-quote.
For example, the following example is also a valid CSV that will result in the exact same data in memory:
“user_id”,”name”,”favorite_food”
“1”,”Alicia”,”Burger”
“2”,”Brandon”,”Mac and Cheese”
Notice how we have quoted all headers and values with a pair of double-quotes. The surrounding “ character is called the quote character and we use it as a pair.
While the quote character is optional and typically left out in CSV files, it becomes mandatory when your field values in the row contain the delimiter, double-quote, or line breaks (i.e. multiline values). In these cases, you must enclose the field values with the quote character for the record to be properly parsed:
“user_id”,”name”,”favorite_food”
“1”,”Alicia”,”Burger, Fries and Everything Nice”
“2”,”Brandon”,”Mac
and
Cheese” --> This is a multiline record
In this example, there is one header row along with two records. The second record is a multiline record with a new line within the last field. Without the quote character, this example will be read as 4 different records in memory, since there are 4 lines.
Multiline records or records with double quotes like these are fairly common these days.
For example, imagine that you have a website or an app with a form with free-text fields for the users to fill in. The data from the form is saved to a database and eventually extracted to CSVs to be processed by some data pipelines. Hence, if in doubt, always export your data to CSVs with quotes!
Note: some libraries allow you to configure the quote character from double quote to something else. However, the default quote character is a double quote.
Escape
Last but not least, we have the escape character in CSV format.
RFC-4180 states that if double-quotes are used to enclose fields, then a double-quote appearing within a field must be escaped by preceding it with another double quote.
For example, the double quotes appear in one of the field values of the following example whereby the input data has a quoted “Pizza Hut”:
“user_id”,”name”,”favorite_food”
“1”,”Alicia”,”I like pizzas from “”Pizza Hut”””
In this scenario, a proper CSV library should parse this field value into memory as I like pizzas from “Pizza Hut” with only a single double-quote. The preceding double-quote in each pair of the double-quotes “disappeared” after being read into memory, because they were the escape characters for the second double-quote in each pair.
The default escape character, just like the quote character, is also a double-quote.
Summary of Special Characters
If you are confused at this point, just remember that:
- The delimiter character is used to separate a field from others.
- The quote character is used to wrap a field value.
- The escape character is used to escape the quote character if it appears within the field value.
The delimiter defaults to a comma. Both quote and escape characters default to a double-quote.
Lack of Data Types
Another problem is that CSV cannot capture data types.
For example, if you have a CSV file with the following values:
user_id,name,age
1,Alicia,25
2,Brandon,30
A CSV parser cannot differentiate if 25 is a string “25” or a numeric 25. It is up to your application to interpret it correctly in memory.
This can pose a problem if certain data types are expected for some columns downstream but there is dirty data:
user_id,name,age
1,Alicia,25
2,Brandon,30
3,Cindy,thirty three
Field values with incorrect types can be passed in. One way to fix this is to supply an external schema along with the CSVs files to interpret the types and to make use of a validation stage in your Staging layer to check the records against the schema.
Large Files Size
CSV files beyond a certain size are difficult to work with.
As an example, I recently worked with files up to 2GB, which are frankly still quite small in the world of Big data. The files contain only a few million records but most text editors or popular programs like Excel or Apple Numbers were already unable to open them.
Some developer tools like IntelliJ or Sublime Text could open these files in a read-only mode but the normal plugins such as Rainbow CSV do not work. I have since given up on opening such huge CSV files and now typically just peek them via bash:
$ head large-file.csv
Sometimes, due to their size, CSVs also have to be compressed and sent as ZIP files. It also takes more time to process huge CSVs compared to other file formats. Such scenarios often make me think of replacing CSV with other encoding formats such as AVRO that can handle these problems out-of-the-box.
Conclusion
In short, CSVs can be tricky to work with, especially for large applications.
If CSV ingestion is part of the core logic of your data pipelines, I would recommend exploring the data first to determine the special cases within the data. Following that, you can experiment and test different libraries to determine if they suit your needs.