Iterate twice on values (MapReduce)

Unfortunately this is not possible without caching the values as in Andreas_D’s answer.

Even using the new API, where the Reducer receives an Iterable rather than an Iterator, you cannot iterate twice. It’s very tempting to try something like:

for (IntWritable value : values) {
    // first loop
}

for (IntWritable value : values) {
    // second loop
}

But this won’t actually work. The Iterator you receive from that Iterable‘s iterator() method is special. The values may not all be in memory; Hadoop may be streaming them from disk. They aren’t really backed by a Collection, so it’s nontrivial to allow multiple iterations.

You can see this for yourself in the Reducer and ReduceContext code.

Caching the values in a Collection of some sort may be the easiest answer, but you can easily blow the heap if you are operating on large datasets. If you can give us more specifics on your problem, we may be able to help you find a solution that doesn’t involve multiple iterations.

Leave a Comment