Introduction
System engineers are used to CSV files. They may be considered as a good bridge between Excel and the CLI. They are also a common way to format the output of a script so data can be exploited easily from command lines tools such as sort , awk,uniq, grep, and so on.
The point is that when there is a significant amount of data, parsing it with shell may be painful and extremely slow.
This is a very simple and quick post about parsing CSV files in python, perl and golang.
The use case
I consider that I have a CSV file with 4 fields per row. The first field is a server name, and I may have 700 different servers. The second field is a supposed disk size for a certain partition. The other fields are just present to discriminate the rows in my example
What I would like to know is the total disk size per server.
I will implement three versions of parsing, and I will look for the result of a certain server to see if the computation is ok. Then I will compare the exeuction time of each implementation
Generating the samples
I’m using a very simple shell loop to generate the samples. I’m generating a file with 600000 lines.
I’ve randmly chosen to check the size of SERVER788 (but I will compute the size for all the servers).
I have a lot of entries for my server.
|
|
The implementations
Here are the implementation in each language:
The go implementation
The go implementation relies on the encoding/csv
package.
The package has implemented a Reader
method that can take the famous io.Reader
as input. Therefore I will read a stream of data and not load the whole file in memory.
The perl implementation
I did not find a csv implementation in perl that would be more efficient than the code below. Any pointer appreciated.
The python implementation
Python does have a csv module. This module is optimized and seems to be as flexible as the implementation of go. It reads a stream as well.
The results
I’ve run all the scripts through the GNU time command. I didn’t used the built-in time command because I wanted to check the memory footprint as well as the execution time.
Here are the results
Conclusion
All of the languages have very nice execution time: below 4 seconds to process the sample file. Go gives the best performances, but it’s insignificant as long as the files do not exceed millions of records. The memory footprint is low for eache implementation.
It’s definitly worth a bit of work to implement a decent parser in a “modern language”
instead of relying on a while read
loop or a for i in $(cat...
in shell.
I didn’t write a shell implementation, but it would have take ages to run on my chromebook anyway.