using bash to extract certain columns from a CSV

So here’s a handy dataset: https://npiregistry.cms.hhs.gov

It contains a registry of medical providers (clinics, physicians, etc) and their National Provider ID called an NPI.  Really useful if you need to search for that information or if you need to add it to your datasets, for example, if you have the NPI but you don’t have the first name or last name, or vice versa.

Here’s the problem.  This freaking dataset is HUGE!  The CSV file that I need to match against my tables had over 100 columns, with something like 4 million rows.  That is 400,000,000 cells.  My ‘puter coughed and sputtered.  Here comes bash to save the day.

Check out this post a stackoverflow.

User702403 had a question, but for me, the question was the answer.


awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv

To a bash noobie such as myself, this is gold. And it just about says it all. $1 will be the first column. $2 will be the next. Use for example awk -F "," '{print $1 "," $6 "," $7}' infile.csv > outfile.csv if you want print the first, 6th, and 7th columns. Fortunately, all of my needed columns were in the front but if you need to extract the 3006th column and you didn’t know which one it was, you could copy the first row into a spreadsheet and then pivot.