The simplest way I can think of is to use collect_list
import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))
More Related Contents:
- Calling Java/Scala function from a task
- Efficient string matching in Apache Spark
- Load CSV file with Spark
- How to link PyCharm with PySpark?
- How to turn off INFO logging in Spark?
- Does spark predicate pushdown work with JDBC?
- Count number of non-NaN entries in each column of Spark dataframe with Pyspark
- importing pyspark in python shell
- collect_list by preserving order based on another variable
- Rename nested field in spark dataframe
- Spark DataFrame: Computing row-wise mean (or any aggregate operation)
- How to determine if object is a valid key-value pair in PySpark
- Updating a dataframe column in spark
- How to run multiple jobs in one Sparkcontext from separate threads in PySpark?
- PySpark converting a column of type ‘map’ to multiple columns in a dataframe
- Best way to get the max value in a Spark dataframe column
- How to map features from the output of a VectorAssembler back to the column names in Spark ML?
- Pyspark replace strings in Spark dataframe column
- Apache Spark Python Cosine Similarity over DataFrames
- Comparing columns in Pyspark
- GroupBy column and filter rows with maximum value in Pyspark
- Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames
- Cannot find col function in pyspark
- Add Jar to standalone pyspark
- Spark mllib predicting weird number or NaN
- Concatenate two PySpark dataframes
- Filtering DataFrame using the length of a column
- How to process RDDs using a Python class?
- What is the Spark DataFrame method `toPandas` actually doing?
- how to add Row id in pySpark dataframes [duplicate]