apache-spark - w3toppers.com

Spark yarn cluster vs client – how to choose which one to use?

A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the … Read more

How DAG works under the covers in RDD?

Even i have been looking in the web to learn about how spark computes the DAG from the RDD and subsequently executes the task. At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler. The DAG scheduler divides operators into stages of tasks. … Read more

Default Partitioning Scheme in Spark

You have to distinguish between two different things: partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition. partitioning as splitting input into multiple … Read more

How to save/insert each DStream into a permanent table

Vanilla Spark does not provide a way to persist data unless you’ve downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the … Read more

Link Spark with iPython Notebook

I have Jupyter installed, and indeed It is simpler than you think: Install anaconda for OSX. Install jupyter typing the next line in your terminal Click me for more info. ilovejobs@mymac:~$ conda install jupyter Update jupyter just in case. ilovejobs@mymac:~$ conda update jupyter Download Apache Spark and compile it, or download and uncompress Apache Spark … Read more

Spark: Reading files using different delimiter than new line

You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g., import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapreduce.Job import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.lib.input.TextInputFormat val conf = new Configuration(sc.hadoopConfiguration) conf.set(“textinputformat.record.delimiter”, “X”) val input = sc.newAPIHadoopFile(“file_path”, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf) val lines = input.map { case (_, text) => text.toString} println(lines.collect) For example, my input is a file containing one … Read more

Apache Spark does not delete temporary directories

Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up. spark.worker.cleanup.interval, default is 1800, … Read more

Filtering a spark dataframe based on date

The following solutions are applicable since spark 1.5 : For lower than : // filter data where the date is lesser than 2015-03-14 data.filter(data(“date”).lt(lit(“2015-03-14”))) For greater than : // filter data where the date is greater than 2015-03-14 data.filter(data(“date”).gt(lit(“2015-03-14”))) For equality, you can use either equalTo or === : data.filter(data(“date”) === lit(“2015-03-14”)) If your DataFrame … Read more

pyspark dataframe filter or include based on list

what it says is “df.score in l” can not be evaluated because df.score gives you a column and “in” is not defined on that column type use “isin” The code should be like this: # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, [“id”, “score”]) # … Read more

How to split a list to multiple columns in Pyspark?

It depends on the type of your “list”: If it is of type ArrayType(): df = hc.createDataFrame(sc.parallelize([[‘a’, [1,2,3]], [‘b’, [2,3,4]]]), [“key”, “value”]) df.printSchema() df.show() root |– key: string (nullable = true) |– value: array (nullable = true) | |– element: long (containsNull = true) you can access the values like you would with python using … Read more