我们假设我在python pandas中有下表
friend_description friend_definition
James is dumb dumb dude
Jacob is smart smart guy
Jane is pretty she looks pretty
Susan is rich she is rich
这里,在第一行中,'dumb'一词包含在两列中。在第二行中,'smart'包含在两列中。在第三行中,'pretty'包含在两列中,在最后一行中,'is'和'rich'包含在两列中。我想创建以下列:
friend_description friend_definition word_overlap overlap_count
James is dumb dumb dude dumb 1
Jacob is smart smart guy smart 1
Jane is pretty she looks pretty pretty 1
Susan is rich she is rich is rich 2
我可以使用for循环来手动定义带有这些东西的新列,但我想知道pandas中是否有一个函数可以使这种类型的操作更加平滑。
答案 0 :(得分:4)
使用此类字符串时,简单列表理解似乎是最快的方法:
oci_bind_by_name($etat, ':p6', $pduree);
单In [112]: df['word_overlap'] = [set(x[0].split()) & set(x[1].split()) for x in df.values]
In [113]: df['overlap_count'] = df['word_overlap'].str.len()
In [114]: df
Out[114]:
friend_description friend_definition word_overlap overlap_count
0 James is dumb dumb dude {dumb} 1
1 Jacob is smart smart guy {smart} 1
2 Jane is pretty she looks pretty {pretty} 1
3 Susan is rich she is rich {rich, is} 2
:
apply(..., axis=1)
In [85]: df['word_overlap'] = df.apply(lambda r: set(r['friend_description'].split()) &
...: set(r['friend_definition'].split()),
...: axis=1)
...:
In [86]: df['overlap_count'] = df['word_overlap'].str.len()
In [87]: df
Out[87]:
friend_description friend_definition word_overlap overlap_count
0 James is dumb dumb dude {dumb} 1
1 Jacob is smart smart guy {smart} 1
2 Jane is pretty she looks pretty {pretty} 1
3 Susan is rich she is rich {rich, is} 2
方法:
apply().apply(..., axis=1)
时间对抗40.000行DF:
In [23]: df['word_overlap'] = (df.apply(lambda x: x.str.split(expand=False))
...: .apply(lambda r: set(r['friend_description']) & set(r['friend_definition']),
...: axis=1))
...:
In [24]: df['overlap_count'] = df['word_overlap'].str.len()
In [25]: df
Out[25]:
friend_description friend_definition word_overlap overlap_count
0 James is dumb dumb dude {dumb} 1
1 Jacob is smart smart guy {smart} 1
2 Jane is pretty she looks pretty {pretty} 1
3 Susan is rich she is rich {is, rich} 2
答案 1 :(得分:3)
一个班轮...因为,为什么不呢?无论如何,我在这里赞成@ MaxU的回答。我不妨自己留下一个。
df.join(
df.applymap(lambda x: set(x.split())).pipe(
lambda d: d.friend_definition - (d.friend_definition - d.friend_description)
).pipe(lambda s: pd.DataFrame(dict(word_overlap=s, overlap_count=s.str.len())))
)
friend_description friend_definition overlap_count word_overlap
0 James is dumb dumb dude 1 {dumb}
1 Jacob is smart smart guy 1 {smart}
2 Jane is pretty she looks pretty 1 {pretty}
3 Susan is rich she is rich 2 {rich, is}
答案 2 :(得分:1)
对凡人(比如我)更容易理解?
>>> import pandas as pd
>>> df = pd.read_csv('user98235.csv', sep='\t')
>>> def f(columns):
... f_desc, f_def = columns[0], columns[1]
... common = set(f_desc.split()).intersection(set(f_def.split()))
... return common, len(common)
...
>>> df[['word_overlap', 'overlap_count']] = df.apply(f, axis=1, raw=True).apply(pd.Series)
>>> df
friend_description friend_definition word_overlap overlap_count
0 James is dumb dumb dude {dumb} 1
1 Jacob is smart smart guy {smart} 1
2 Jane is pretty she looks pretty {pretty} 1
3 Susan is rich she is rich {is, rich} 2