在python pandas中计算两列之间的许多相同的单词

时间:2017-12-10 22:22:39

标签: python pandas

我们假设我在python pandas中有下表

friend_description  friend_definition
    James is dumb      dumb dude
    Jacob is smart     smart guy
    Jane is pretty     she looks pretty
    Susan is rich      she is rich

这里,在第一行中,'dumb'一词包含在两列中。在第二行中,'smart'包含在两列中。在第三行中,'pretty'包含在两列中,在最后一行中,'is'和'rich'包含在两列中。我想创建以下列:

friend_description  friend_definition      word_overlap    overlap_count
    James is dumb      dumb dude              dumb             1
    Jacob is smart     smart guy              smart            1
    Jane is pretty     she looks pretty       pretty           1
    Susan is rich      she is rich            is rich          2

我可以使用for循环来手动定义带有这些东西的新列,但我想知道pandas中是否有一个函数可以使这种类型的操作更加平滑。

3 个答案:

答案 0 :(得分:4)

使用此类字符串时,简单列表理解似乎是最快的方法:

oci_bind_by_name($etat, ':p6', $pduree);

In [112]: df['word_overlap'] = [set(x[0].split()) & set(x[1].split()) for x in df.values] In [113]: df['overlap_count'] = df['word_overlap'].str.len() In [114]: df Out[114]: friend_description friend_definition word_overlap overlap_count 0 James is dumb dumb dude {dumb} 1 1 Jacob is smart smart guy {smart} 1 2 Jane is pretty she looks pretty {pretty} 1 3 Susan is rich she is rich {rich, is} 2

apply(..., axis=1)

In [85]: df['word_overlap'] = df.apply(lambda r: set(r['friend_description'].split()) & ...: set(r['friend_definition'].split()), ...: axis=1) ...: In [86]: df['overlap_count'] = df['word_overlap'].str.len() In [87]: df Out[87]: friend_description friend_definition word_overlap overlap_count 0 James is dumb dumb dude {dumb} 1 1 Jacob is smart smart guy {smart} 1 2 Jane is pretty she looks pretty {pretty} 1 3 Susan is rich she is rich {rich, is} 2 方法:

apply().apply(..., axis=1)

时间对抗40.000行DF:

In [23]: df['word_overlap'] = (df.apply(lambda x: x.str.split(expand=False))
    ...:                         .apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
    ...:                                axis=1))
    ...:

In [24]: df['overlap_count'] = df['word_overlap'].str.len()

In [25]: df
Out[25]:
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {is, rich}              2

答案 1 :(得分:3)

一个班轮...因为,为什么不呢?无论如何,我在这里赞成@ MaxU的回答。我不妨自己留下一个。

df.join(
    df.applymap(lambda x: set(x.split())).pipe(
        lambda d: d.friend_definition - (d.friend_definition - d.friend_description)
    ).pipe(lambda s: pd.DataFrame(dict(word_overlap=s, overlap_count=s.str.len())))
)

  friend_description friend_definition  overlap_count word_overlap
0      James is dumb         dumb dude              1       {dumb}
1     Jacob is smart         smart guy              1      {smart}
2     Jane is pretty  she looks pretty              1     {pretty}
3      Susan is rich       she is rich              2   {rich, is}

答案 2 :(得分:1)

对凡人(比如我)更容易理解?

>>> import pandas as pd
>>> df = pd.read_csv('user98235.csv', sep='\t')
>>> def f(columns):
...     f_desc, f_def = columns[0], columns[1]
...     common = set(f_desc.split()).intersection(set(f_def.split()))
...     return common, len(common)
... 
>>> df[['word_overlap', 'overlap_count']] = df.apply(f, axis=1, raw=True).apply(pd.Series)
>>> df
  friend_description friend_definition word_overlap  overlap_count
0      James is dumb         dumb dude       {dumb}              1
1     Jacob is smart         smart guy      {smart}              1
2     Jane is pretty  she looks pretty     {pretty}              1
3      Susan is rich       she is rich   {is, rich}              2