apache-spark - w3toppers.com

Spark DataFrame Schema Nullable Fields

In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types. You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn’t support nullability constraints, then application of a schema cannot either. At the end … Read more

How to manually set group.id and commit kafka offsets in spark structured streaming?

[*] tl;dr It is not possible to commit any messages to Kafka. Starting with Spark version 3.x you can define the name of the Kafka consumer group, however, this still does not allow you to commit any messages. Since Spark 3.0.0 According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an … Read more

Spark on YARN + Secured hbase

You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279 A little-known fact is that Spark now generates Hadoop “auth tokens” for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don’t have to mess again with Kerberos auth, keytabs, etc. … Read more

spark.sql.crossJoin.enabled for Spark 2.x

Use collect_list and collect_set in Spark SQL

Spark 2.0+: SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required. Spark 2.0-SNAPSHOT (before 2016-05-03): You have to enable Hive support for a given SparkSession: In Scala: val spark = SparkSession.builder .master(“local”) .appName(“testing”) .enableHiveSupport() // <- enable Hive support. .getOrCreate() In Python: spark = (SparkSession.builder .enableHiveSupport() .getOrCreate()) … Read more

How to get word details from TF Vector RDD in Spark ML Lib?

Well, you can’t. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there. If you’re using a large hash and number of unique tokens is relatively low then you can try to … Read more

Adding a group count column to a PySpark dataframe

When you do a groupBy(), you have to specify the aggregation before you can display the results. For example: import pyspark.sql.functions as f data = [ (‘a’, 5), (‘a’, 8), (‘a’, 7), (‘b’, 1), ] df = sqlCtx.createDataFrame(data, [“x”, “y”]) df.groupBy(‘x’).count().select(‘x’, f.col(‘count’).alias(‘n’)).show() #+—+—+ #| x| n| #+—+—+ #| b| 1| #| a| 3| #+—+—+ Here … Read more

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you’re going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments: RDD[(Long, Vector)] = … … Read more

Pyspark : forward fill with last observation for a DataFrame

Another workaround to get this working, is to try something like this: from pyspark.sql import functions as F from pyspark.sql.window import Window window = ( Window .partitionBy(‘cookie_id’) .orderBy(‘Time’) .rowsBetween(Window.unboundedPreceding, Window.currentRow) ) final = ( joined .withColumn(‘UserIDFilled’, F.last(‘User_ID’, ignorenulls=True).over(window)) ) So what this is doing is that it constructs your window based on the partition key … Read more

How to fix ‘TypeError: an integer is required (got type bytes)’ error when trying to run pyspark after installing spark 2.4.4

This is happening because you’re using python 3.8. The latest pip release of pyspark (pyspark 2.4.4 at time of writing) doesn’t support python 3.8. Downgrade to python 3.7 for now, and you should be fine.