Job queue for Hive action in oozie

A. Oozie specifics Oozie propagates the “regular” Hadoop properties to a “regular” MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it’s a real MapReduce job. Hence it uses a different set of undocumented properties always prefixed with … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

What methods can we use to reshape VERY large data sets?

If your real data is as regular as your sample data we can be quite efficient by noticing that reshaping a matrix is really just changing its dim attribute. 1st on very small data library(data.table) library(microbenchmark) library(tidyr) matrix_spread <- function(df1, key, value){ unique_ids <- unique(df1[[key]]) mat <- matrix( df1[[value]], ncol= length(unique_ids),byrow = TRUE) df2 <- … Read more

Is Spark’s KMeans unable to handle bigdata?

I think the ‘hanging’ is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization. I opened https://issues.apache.org/jira/browse/SPARK-17389 to track … Read more

Strategies for reading in CSV files in pieces?

After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections! 1) Open a connection to your file con = file(“file.csv”, “r”) 2) Read in chunks of code with read.csv read.csv(con, nrows=”CHUNK SIZE”,…) Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as … Read more