Merging multiple files into one within Hadoop

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) – add compression using MR flags. hadoop jar \ $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br> -Dmapred.reduce.tasks=1 \ -Dmapred.job.queue.name=$QUEUE \ -input “$INPUT” \ -output “$OUTPUT” \ -mapper cat \ -reducer cat If you want compression … Read more

Pivot table with Apache Pig

You can do it in 2 ways: 1. Write a UDF which returns a bag of tuples. It will be the most flexible solution, but requires Java code; 2. Write a rigid script like this: inpt = load ‘/pig_fun/input/pivot.txt’ as (Id, Column1, Column2, Column3); bagged = foreach inpt generate Id, TOBAG(TOTUPLE(‘Column1’, Column1), TOTUPLE(‘Column2’, Column2), TOTUPLE(‘Column3’, … Read more