我正在使用Python和Pandas,试图以有效的方式根据ID列表而不是唯一ID总结数据行在不同行中的值。
df:
Name - ID - Related IDs - Value
z - 123 - ['aaa','bbb','ccc'] - 10
w - 456 - ['aaa'] - 20
y - 789 - ['ggg','hhh','jjj'] - 50
x - 012 - ['jjj','hhh'] - 60
r - 015 - ['hhh'] - 15
可能会尝试通过列表的元素爆炸每一行,但它可能会复制要累加的值,并且在时间和资源方面可能不是有效的解决方案。
```python
f = {'Sum': 'sum'}
df = df.groupby(['Related IDs']).agg(f)
#it is not working has is matching element wise
#rather then by element
df = df.reset_index()
```
我期望的是新列“ Sum”,该列汇总具有一个或多个相关ID的行的值“ Value”。如下:
Name - ID - Related IDs - Value - Sum
z - 123 - ['aaa','bbb','ccc'] - 10 - 30
w - 456 - ['aaa'] - 20 - 30
y - 789 - ['ggg','hhh','jjj'] - 50 - 125
x - 012 - ['jjj','hhh'] - 60 - 125
r - 015 - ['hhh'] - 15 - 125
答案 0 :(得分:1)
将networkx
与connected_components
一起使用:
import networkx as nx
from itertools import combinations, chain
#if necessary convert to lists
df['Related IDs'] = df['Related IDs'].apply(ast.literal_eval)
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,2)) for l in df['Related IDs']]
L2 = list(chain.from_iterable(L2_nested))
print (L2)
[('aaa', 'bbb'), ('aaa', 'ccc'), ('bbb', 'ccc'),
('ggg', 'hhh'), ('ggg', 'jjj'), ('hhh', 'jjj'), ('jjj', 'hhh')]
#create the graph from the dataframe
G=nx.Graph()
G.add_edges_from(L2)
connected_comp = nx.connected_components(G)
#create dict for common values
node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}
#create groups by mapping first value of column Related IDs
groups = [node2id.get(x[0]) for x in df['Related IDs']]
print (groups)
[0, 0, 1, 1, 1]
#get sum to new column
df['Sum'] = df.groupby(groups)['Value'].transform('sum')
print (df)
Name ID Related IDs Value Sum
0 z 123 [aaa, bbb, ccc] 10 30
1 w 456 [aaa] 20 30
2 y 789 [ggg, hhh, jjj] 50 125
3 x 12 [jjj, hhh] 60 125
4 r 15 [hhh] 15 125