How to make plots from distributed data from R

I know you write that no aggregation is (should be?) done, but I’d wager that is precisely what you need and want to do. The point of distributed computing is largely that partial results are computed, well, distributed at each node. For very big data sets, each node (often) sees only a subset of the data.

In regards to the plotting: a scatter plot more that even a few thousand (not to mention a 100 million) points will contain a significant amount of overplotting. Either you ‘fix’ that by making the points transparent, you do a density estimate, or you do some binning of the data (e.g. a hexbin plot or a heatmap). The latter can be done distributed by the nodes and the plot. The returned binned results from each node can then be aggregated to a final results by the master node and be plotted.

Even if you somehow had a node making a scatter plot of 100 million points, what is your output format? Vector graphics (e.g. pdf/svg) would create a huge file. Raster graphics (e.g. jpg, png) will effectively aggregate on your behalf when the plot is rasterized — so you might as well control that yourself with bins the size of pixels.

Leave a Comment