所以我有这样的数据框:
df = pd.DataFrame(np.array(['This here is text','My Text was here','This was not ready']), columns=['Text'])
Text
0 This here is text
1 My Text was here
2 This was not ready
3 nothing common
我想创建一个包含以下结果的新数据框:
row1 row2 common_text
0 1 here,text
0 2 this
1 2 was
新数据框,每对行之间包含所有常用字。此外,如果两行没有任何共同点,则忽略该对,如1,3和0,3的情况。
我的问题是,有没有更快或Pythonic方式来做,而不是遍历所有行两次以提取常用术语并将它们存储在一起?
答案 0 :(得分:1)
如果你只想要一个循环,请转到itertools.product
但它可能不那么pythonic。
import itertools
# new_data_frame = ...
for row1, row2 in itertools.product(range(len(df)), range(len(df)):
# possibly add
为了获得常用词,你可以
set(text1.lower().split()) & set(text2.lower().split())
这是相当pythonic。出于性能原因,我会将每个句子保存为中间数组中的一个集合,然后在以后联合这些集合。
temp = [set(s.lower().split()) for s in df['Text']]
答案 1 :(得分:1)
from itertools import combinations
result = []
# Iterate through each pair of rows.
for row_1, row_2 in combinations(df['Text'].index, 2):
# Find set of lower case words stripped of whitespace for each row in pair.
s1, s2 = [set(df.loc[row, 'Text'].lower().strip().split()) for row in (row_1, row_2)]
# Find the common words to the pair of rows.
common = s1.intersection(s2)
if common:
# If there are words in common, append to the results as a common separated string (could also append the set of list of words).
result.append([row_1, row_2, ",".join(common)])
>>> pd.DataFrame(result, columns=['row1', 'row2', 'common_text'])
row1 row2 common_text
0 0 1 text,here
1 0 2 this
2 1 2 was