我有以下数据框
df = pd.DataFrame(np.array([['This here is text',{1:10,2:20}],['My Text was here',{2:5,3:30}],['This was not ready',{5:9,1:2}]]), columns=['Text','Other info'])
Text Other info
0 This here is text {1: 10, 2: 20}
1 My Text was here {2: 5, 3: 30}
2 This was not ready {1: 2, 5: 9}
我需要在每对行之间找到常用术语并减少字典,比如
row1 row2 common_text other_info
0 1 here,text {2 : 25}
0 2 this {1 : 12}
1 2 was {}
是否有任何pythonic方法来执行此操作而不是拆分每对行并进行比较?我的意思是,因为我会发现另外,我的数据是大的(> 20000)行,所以我希望任何帮助更快的解决方案。
答案 0 :(得分:1)
您可以做的是从每一行中取出文本,将其拆分为单词,然后使用单词和行号填充新的defaultdict(list)
,其中单词为键,行号为数据。< / p>
In [1]: from collections import defaultdict
In [2]: rows = ['This here is text', 'My Text was here', 'This was not ready']
In [3]: where = defaultdict(list)
In [4]: for n, line in enumerate(rows): # A mockup for the pandas table.
...: for word in line.split():
...: where[word.lower()].append(n)
...:
In [5]: where
Out[5]:
defaultdict(list,
{'here': [0, 1],
'is': [0],
'my': [1],
'not': [2],
'ready': [2],
'text': [0, 1],
'this': [0, 2],
'was': [1, 2]})
在较大的数据集中,您会在两行以上找到单词。
但是使用行号列表,您可以使用permutations
轻松地将所有可能的组合组合成两行:
In [6]: from itertools import permutations
In [7]: list(permutations([1,3,7], 2))
Out[7]: [(1, 3), (1, 7), (3, 1), (3, 7), (7, 1), (7, 3)]
作为最后一步,您可以合并原始数据框中的字典。