bigdata
How to insert big data on the laravel?
As it was stated, chunks won’t really help you in this case if it is a time execution problem. I think that bulk insert you are trying to use cannot handle that amount of data , so I see 2 options: 1 – Reorganise your code to properly use chunks, this will look something like … Read more
Job queue for Hive action in oozie
A. Oozie specifics Oozie propagates the “regular” Hadoop properties to a “regular” MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it’s a real MapReduce job. Hence it uses a different set of undocumented properties always prefixed with … Read more
Determining optimal number of Spark partitions based on workers, cores and DataFrame size
Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more
What methods can we use to reshape VERY large data sets?
If your real data is as regular as your sample data we can be quite efficient by noticing that reshaping a matrix is really just changing its dim attribute. 1st on very small data library(data.table) library(microbenchmark) library(tidyr) matrix_spread <- function(df1, key, value){ unique_ids <- unique(df1[[key]]) mat <- matrix( df1[[value]], ncol= length(unique_ids),byrow = TRUE) df2 <- … Read more
PySpark DataFrames – way to enumerate without converting to Pandas?
It doesn’t work because: the second argument for withColumn should be a Column not a collection. np.array won’t work here when you pass “index in indexes” as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier PySpark >= 1.4.0 You can add row numbers using … Read more
Is Spark’s KMeans unable to handle bigdata?
I think the ‘hanging’ is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization. I opened https://issues.apache.org/jira/browse/SPARK-17389 to track … Read more
Strategies for reading in CSV files in pieces?
After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections! 1) Open a connection to your file con = file(“file.csv”, “r”) 2) Read in chunks of code with read.csv read.csv(con, nrows=”CHUNK SIZE”,…) Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as … Read more