What's the most robust way to efficiently parse CSV using awk?

If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT):

$ echo 'foo,"field,""with"",commas",bar' |
    awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i " <" $i ">"}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>

or the equivalent using any awk:

$ echo 'foo,"field,""with"",commas",bar' |
    awk -v fpat="[^,]*|("([^"]|"")*")" -v OFS=',' '{
        rec = $0
        $0 = ""
        i = 0
        while ( (rec!="") && match(rec,fpat) ) {
            $(++i) = substr(rec,RSTART,RLENGTH)
            rec = substr(rec,RSTART+RLENGTH+1)
        }
        for (i=1; i<=NF;i++) print i " <" $i ">"
    }'
1 <foo>
2 <"field,""with"",commas">
3 <bar>

See https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV for info on the specific FPAT setting I use above.

If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:

$ awk -v RS='"([^"]|"")*"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file.csv
"rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4"
"rec2; fld1.1  fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3;fld2""",

Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk* is:

$ cat decsv.awk
function buildRec(      fpat,fldNr,fldStr,done) {
    CurrRec = CurrRec $0
    if ( gsub(/"/,"&",CurrRec) % 2 ) {
        # The string built so far in CurrRec has an odd number
        # of "s and so is not yet a complete record.
        CurrRec = CurrRec RS
        done = 0
    }
    else {
        # If CurrRec ended with a null field we would exit the
        # loop below before handling it so ensure that cannot happen.
        # We use a regexp comparison using a bracket expression here
        # and in fpat so it will work even if FS is a regexp metachar
        # or a multi-char string like "\\\\" for \-separated fields.
        CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
        $0 = ""
        fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
        while ( (CurrRec != "") && match(CurrRec,fpat) ) {
            fldStr = substr(CurrRec,RSTART,RLENGTH)
            # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
            if ( gsub(/^"|"$/,"",fldStr) ) {
                gsub(/""/, "\"", fldStr)
            }
            $(++fldNr) = fldStr
            CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
        }
        CurrRec = ""
        done = 1
    }
    return done
}

# If your input has \-separated fields, use FS="\\\\"; OFS="\\"
BEGIN { FS=OFS="," }
!buildRec() { next }
{
    printf "Record %d:\n", ++recNr
    for (i=1;i<=NF;i++) {
        # To replace newlines with blanks add gsub(/\n/," ",$i) here
        printf "    $%d=<%s>\n", i, $i
    }
    print "----"
}

$ awk -f decsv.awk file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

The above assumes UNIX line endings of \n. With Windows \r\n line endings it’s much simpler as the “newlines” within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.

It works by simply counting how many "s are present so far in the current record whenever it encounters the RS – if it’s an odd number then the RS (presumably \n but doesn’t have to be) is mid-field and so we keep building the current record but if it’s even then it’s the end of the current record and so we can continue with the rest of the script processing the now complete record.

*I say “modern awk” above because there’s apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you’re using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that’s not an option then I expect you can work around that particular bug by changing this:

        if ( gsub(/^"|"$/,"",fldStr) ) {

to this:

        if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {

Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:

@mosvy for escaped double quotes within fields.
@datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.

Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.

What’s the most robust way to efficiently parse CSV using awk?

Leave a Comment Cancel reply

More Related Contents:

Leave a Comment Cancel reply