Pandas dataframe groupby text value that occurs in two columns

This seems like a graph problem.

You could try to use networkx:

import networkx as nx

G = nx.from_pandas_edgelist(df, 'v1', 'v2')

clusters = nx.connected_components(G)

output:

[{'be', 'belong'}, {'delay', 'increase', 'decrease'}, {'analyze', 'assay'},
 {'report', 'bespeak', 'circulate'}, {'induce', 'generate'}, {'trip', 'cause'},
 {'distinguish', 'isolate'}, {'infect', 'give'}, {'prove', 'result'},
 {'intercede', 'describe', 'explain'}, {'affect', 'expose'}, {'restrict', 'suppress'}]

As graph:

graph

Small function to plot the graph in jupyter:

def nxplot(G):
    from networkx.drawing.nx_agraph import to_agraph
    A = to_agraph(G)
    A.layout('dot')
    A.draw('/tmp/graph.png')
    from IPython.display import Image
    return Image(filename="/tmp/graph.png")

Leave a Comment