Reading 40 GB csv file into R using bigmemory

I don’t know about bigmemory, but to satisfy your challenges you don’t need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in.

Here’s an example using awk (assuming you want 100 random lines from a file that has 1M lines).

read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
                       !/NULL/{if (rand() < m/(length - NR + 1)) {
                                 print; m--;
                                 if (m == 0) exit;
                              }}\' filename'
        )) -> df

It wasn’t obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs.

Leave a Comment