Binary search in a sorted (memory-mapped ?) file in Java

I am a big fan of Java’s MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?

I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged ‘java.nio’ ) with raw data, graphs and all.

Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.

To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I’ve crafted paging algorithm backed by an array of MappedByteBuffers. You’ll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB’s use the OS’s virtual memory system to work their magic.

public class StusMagicLargeFileReader  {
    private static final long PAGE_SIZE = Integer.MAX_VALUE;
    private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
    private final byte raw[] = new byte[1];

    public static void main(String[] args) throws IOException {
        File file = new File("/Users/stu/test.txt");
        FileChannel fc = (new FileInputStream(file)).getChannel(); 
        StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
        long position = file.length() / 2;
        String candidate = buffer.getString(position--);
        while (position >=0 && !candidate.equals('\n')) 
            candidate = buffer.getString(position--);
        //have newline position or start of file...do other stuff    
    }
    StusMagicLargeFileReader(FileChannel channel) throws IOException {
        long start = 0, length = 0;
        for (long index = 0; start + length < channel.size(); index++) {
            if ((channel.size() / PAGE_SIZE) == index)
                length = (channel.size() - index *  PAGE_SIZE) ;
            else
                length = PAGE_SIZE;
            start = index * PAGE_SIZE;
            buffers.add(index, channel.map(READ_ONLY, start, length));
        }    
    }
    public String getString(long bytePosition) {
        int page  = (int) (bytePosition / PAGE_SIZE);
        int index = (int) (bytePosition % PAGE_SIZE);
        raw[0] = buffers.get(page).get(index);
        return new String(raw);
    }
}

Leave a Comment