How to split a dataframe string column into two columns?

TL;DR version:

For the simple case of:

  • I have a text column with a delimiter and I want two columns

The simplest solution is:

df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)

You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

In detail:

Andy Hayden’s solution is most excellent in demonstrating the power of the str.extract() method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

1: If you’re unsure what the first two parameters of .str.split() do,
I recommend the docs for the plain Python version of the method.

But how do you go from:

  • a column containing two-element lists

to:

  • two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the .str attribute of a column.

It’s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

But it also has an “indexing” interface for getting each element of a string by its index:

>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object

Of course, this indexing interface of .str doesn’t really care if each element it’s indexing is actually a string, as long as it can be indexed, so:

>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

Then, it’s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

>>> df['AB'].str.split('-', 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

So, another way of accomplishing what we wanted is to do:

>>> df = df[['AB']]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn’t deal well with splits of different lengths:

>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>> 

But expand=True handles it nicely by placing None in the columns for which there aren’t enough “splits”:

>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

Leave a Comment