Scala-Spark Dynamically call groupby and agg with parameter values

Your code is almost correct – with two issues: The return type of your function is DataFrame, but the last line is aggregated.show(), which returns Unit. Remove the call to show to return aggregated itself, or just return the result of agg immediately DataFrame.groupBy expects arguments as follows: col1: String, cols: String* – so you … Read more

Crosstab with a large or undefined number of categories

create table vote (Photo integer, Voter text, Decision text); insert into vote values (1, ‘Alex’, ‘Cat’), (1, ‘Bob’, ‘Dog’), (1, ‘Carol’, ‘Cat’), (1, ‘Dave’, ‘Cat’), (1, ‘Ed’, ‘Cat’), (2, ‘Alex’, ‘Cat’), (2, ‘Bob’, ‘Dog’), (2, ‘Carol’, ‘Cat’), (2, ‘Dave’, ‘Cat’), (2, ‘Ed’, ‘Dog’), (3, ‘Alex’, ‘Horse’), (3, ‘Bob’, ‘Horse’), (3, ‘Carol’, ‘Dog’), (3, ‘Dave’, ‘Horse’), … Read more

Pandas aggregate count distinct

How about either of: >>> df date duration user_id 0 2013-04-01 30 0001 1 2013-04-01 15 0001 2 2013-04-01 20 0002 3 2013-04-02 15 0002 4 2013-04-02 30 0002 >>> df.groupby(“date”).agg({“duration”: np.sum, “user_id”: pd.Series.nunique}) duration user_id date 2013-04-01 65 2 2013-04-02 45 1 >>> df.groupby(“date”).agg({“duration”: np.sum, “user_id”: lambda x: x.nunique()}) duration user_id date 2013-04-01 65 … Read more