Spark DataFrame: count distinct values of every column

In pySpark you could do something like this, using countDistinct():

from pyspark.sql.functions import col, countDistinct

df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns))

Similarly in Scala :

import org.apache.spark.sql.functions.countDistinct
import org.apache.spark.sql.functions.col

df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*)

If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

More Related Contents:

pyspark: count distinct over a window
Finding duplicates from large data set using Apache Spark
Find maximum row per group in Spark DataFrame
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
Spark SQL – load data with JDBC using SQL statement, not table name
Spark Dataframe validating column names for parquet writes
How to optimize partitioning when migrating data from JDBC source?
How to split a list to multiple columns in Pyspark?
Filtering a spark dataframe based on date
How to save/insert each DStream into a permanent table
Spark DataFrame Schema Nullable Fields
Spark load data and add filename as dataframe column
PySpark: how to resample frequencies
Why does Spark think this is a cross / Cartesian join
How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
Spark SQL broadcast hash join
What is the difference between Apache Spark SQLContext vs HiveContext?
What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
Spark lists all leaf node even in partitioned data
Array Intersection in Spark SQL
reduce result datasets into single dataset
How to exclude multiple columns in Spark dataframe in Python
How to get Kafka offsets for structured query for manual and reliable offset management?
Stratified sampling with pyspark
How to use groupBy to collect rows into a map?
Convert between spark.SQL DataFrame and pandas DataFrame [duplicate]
How to turn off scientific notation in pyspark?
When to use Spark DataFrame/Dataset API and when to use plain RDD?
Spark: disk I/O on stage boundaries explanation
spark.ml StringIndexer throws ‘Unseen label’ on fit()

More Related Contents:

Leave a Comment Cancel reply