Link Spark with iPython Notebook

I have Jupyter installed, and indeed It is simpler than you think: Install anaconda for OSX. Install jupyter typing the next line in your terminal Click me for more info. ilovejobs@mymac:~$ conda install jupyter Update jupyter just in case. ilovejobs@mymac:~$ conda update jupyter Download Apache Spark and compile it, or download and uncompress Apache Spark … Read more

Spark: Reading files using different delimiter than new line

You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g., import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapreduce.Job import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.lib.input.TextInputFormat val conf = new Configuration(sc.hadoopConfiguration) conf.set(“textinputformat.record.delimiter”, “X”) val input = sc.newAPIHadoopFile(“file_path”, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf) val lines = input.map { case (_, text) => text.toString} println(lines.collect) For example, my input is a file containing one … Read more

Apache Spark does not delete temporary directories

Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up. spark.worker.cleanup.interval, default is 1800, … Read more

Filtering a spark dataframe based on date

The following solutions are applicable since spark 1.5 : For lower than : // filter data where the date is lesser than 2015-03-14 data.filter(data(“date”).lt(lit(“2015-03-14”))) For greater than : // filter data where the date is greater than 2015-03-14 data.filter(data(“date”).gt(lit(“2015-03-14”))) For equality, you can use either equalTo or === : data.filter(data(“date”) === lit(“2015-03-14”)) If your DataFrame … Read more

pyspark dataframe filter or include based on list

what it says is “df.score in l” can not be evaluated because df.score gives you a column and “in” is not defined on that column type use “isin” The code should be like this: # define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, [“id”, “score”]) # … Read more

How to split a list to multiple columns in Pyspark?

It depends on the type of your “list”: If it is of type ArrayType(): df = hc.createDataFrame(sc.parallelize([[‘a’, [1,2,3]], [‘b’, [2,3,4]]]), [“key”, “value”]) df.printSchema() df.show() root |– key: string (nullable = true) |– value: array (nullable = true) | |– element: long (containsNull = true) you can access the values like you would with python using … Read more