The default aggfunc
in pivot_table
is np.sum
and it doesn’t know what to do with strings and you haven’t indicated what the index should be properly. Trying something like:
pivot_table = unified_df.pivot_table(index=['id', 'contact_id'],
columns="question",
values="response_answer",
aggfunc=lambda x: ' '.join(x))
This explicitly sets one row per id, contact_id
pair and pivots the set of response_answer
values on question
. The aggfunc
just assures that if you have multiple answers to the same question in the raw data that we just concatenate them together with spaces. The syntax of pivot_table
might vary depending on your pandas version.
Here’s a quick example:
In [24]: import pandas as pd
In [25]: import random
In [26]: df = pd.DataFrame({'id':[100*random.randint(10, 50) for _ in range(100)], 'question': [str(random.randint(0,3)) for _ in range(100)], 'response': [str(random.randint(100,120)) for _ in range(100)]})
In [27]: df.head()
Out[27]:
id question response
0 3100 1 116
1 4500 2 113
2 5000 1 120
3 3900 2 103
4 4300 0 117
In [28]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
id 100 non-null int64
question 100 non-null object
response 100 non-null object
dtypes: int64(1), object(2)
memory usage: 3.1+ KB
In [29]: df.pivot_table(index='id', columns="question", values="response", aggfunc=lambda x: ' '.join(x)).head()
Out[29]:
question 0 1 2 3
id
1000 110 120 NaN 100 NaN
1100 NaN 106 108 104 NaN
1200 104 113 119 NaN 101
1300 102 NaN 116 108 120
1400 NaN NaN 116 NaN