我试图根据它们在两列之间的顺序关系进行分组。
d = {'df1':[10,20, 30, 60, 70, 40, 30, 70], 'df2':[20, 30, 40, 80, 70, 50, 90, 100]}
df = pd.DataFrame(data = d)
df
df1 df2
0 10 20
1 20 30
2 30 40
3 60 80
4 80 70
5 40 50
6 30 90
7 70 100
我期望结果如下:
为了更加清楚:-df1和df2具有基于其顺序的关系。例如,10与20有直接关系,10与30至20有间接关系。10与40至20和30有间接关系。再举一个例子,让我们以80与70和70有直接关系。与100到70之间存在间接关系。这适用于其余的列值。
df1 | df2
-----|-------------------
0 10 | 20, 30, 40, 50, 90
1 20 | 30, 40, 50, 90
2 30 | 40, 50, 90
3 60 | 80, 70, 100
4 80 | 70, 100
5 40 | 50
6 70 | 100
我正在尝试使用以下脚本,但无法成功。
(df.groupby('df1')
.agg({ 'df2' : ','.join})
.reset_index()
.reindex(columns=df.columns))
有人可以帮助应对这一挑战吗?如果在堆栈溢出中有任何类似的解决方案,请告诉我。
编辑: 第一个答案与上面的示例完美配合,但是当我尝试使用想要的数据时,它无法正常工作。我的真实数据如下所示。
df1 df2
0 10 20
1 10 30
2 10 80
3 10 90
4 10 120
5 10 140
6 10 170
7 20 180
8 30 40
9 30 165
10 30 175
11 40 20
12 40 50
13 50 60
14 60 70
15 70 180
16 80 180
17 90 100
18 100 110
19 110 180
20 120 130
21 130 180
22 140 150
23 150 160
24 160 165
25 165 180
26 165 200
27 170 175
28 175 180
29 175 200
30 180 190
31 190 200
32 200 210
33 210 220
34 220 230
35 230 240
36 240 -
答案 0 :(得分:3)
一种可能的解决方案:
import pandas as pd
from itertools import chain
l1 = [10, 20, 30, 60, 80, 40, 30, 70]
l2 = [20, 30, 40, 80, 70, 50, 90, 100]
d = dict()
for i, j in zip(l1, l2):
if i == j:
continue
d.setdefault(i, []).append(j)
for k in d:
d[k].extend(chain.from_iterable(d.get(v, []) for v in d[k]))
df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)
打印:
df1 df2
0 10 20, 30, 40, 90, 50
1 20 30, 40, 90, 50
2 30 40, 90, 50
3 60 80, 70, 100
4 80 70, 100
5 40 50
6 70 100
编辑:基于新输入数据的其他解决方案。现在,我正在检查路径中可能存在的圆圈:
import pandas as pd
data = '''
0 10 20
1 10 30
2 10 80
3 10 90
4 10 120
5 10 140
6 10 170
7 20 180
8 30 40
9 30 165
10 30 175
11 40 20
12 40 50
13 50 60
14 60 70
15 70 180
16 80 180
17 90 100
18 100 110
19 110 180
20 120 130
21 130 180
22 140 150
23 150 160
24 160 165
25 165 180
26 165 200
27 170 175
28 175 180
29 175 200
30 180 190
31 190 200
32 200 210
33 210 220
34 220 230
35 230 240
36 240 -
'''
df1, df2 = [], []
for line in data.splitlines()[:-1]: # <--- get rid of last `-` character
line = line.strip().split()
if not line:
continue
df1.append(int(line[1]))
df2.append(int(line[2]))
from pprint import pprint
d = dict()
for i, j in zip(df1, df2):
if i == j:
continue
d.setdefault(i, []).append(j)
for k in d:
seen = set()
for v in d[k]:
for val in d.get(v, []):
if val not in seen:
seen.add(val)
d[k].append(val)
df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)
打印:
df1 df2
0 10 20, 30, 80, 90, 120, 140, 170, 180, 40, 165, 1...
1 20 180, 190, 200, 210, 220, 230, 240
2 30 40, 165, 175, 20, 50, 180, 200, 190, 210, 220,...
3 40 20, 50, 180, 190, 200, 210, 220, 230, 240, 60, 70
4 50 60, 70, 180, 190, 200, 210, 220, 230, 240
5 60 70, 180, 190, 200, 210, 220, 230, 240
6 70 180, 190, 200, 210, 220, 230, 240
7 80 180, 190, 200, 210, 220, 230, 240
8 90 100, 110, 180, 190, 200, 210, 220, 230, 240
9 100 110, 180, 190, 200, 210, 220, 230, 240
10 110 180, 190, 200, 210, 220, 230, 240
11 120 130, 180, 190, 200, 210, 220, 230, 240
12 130 180, 190, 200, 210, 220, 230, 240
13 140 150, 160, 165, 180, 200, 190, 210, 220, 230, 240
14 150 160, 165, 180, 200, 190, 210, 220, 230, 240
15 160 165, 180, 200, 190, 210, 220, 230, 240
16 165 180, 200, 190, 210, 200, 220, 230, 240
17 170 175, 180, 200, 190, 210, 220, 230, 240
18 175 180, 200, 190, 210, 200, 220, 230, 240
19 180 190, 200, 210, 220, 230, 240
20 190 200, 210, 220, 230, 240
21 200 210, 220, 230, 240
22 210 220, 230, 240
23 220 230, 240
24 230 240
或pprint(d, width=250)
:
{10: [20, 30, 80, 90, 120, 140, 170, 180, 40, 165, 175, 100, 130, 150, 190, 20, 50, 200, 110, 160, 60, 210, 70, 220, 230, 240],
20: [180, 190, 200, 210, 220, 230, 240],
30: [40, 165, 175, 20, 50, 180, 200, 190, 210, 220, 230, 240, 60, 70],
40: [20, 50, 180, 190, 200, 210, 220, 230, 240, 60, 70],
50: [60, 70, 180, 190, 200, 210, 220, 230, 240],
60: [70, 180, 190, 200, 210, 220, 230, 240],
70: [180, 190, 200, 210, 220, 230, 240],
80: [180, 190, 200, 210, 220, 230, 240],
90: [100, 110, 180, 190, 200, 210, 220, 230, 240],
100: [110, 180, 190, 200, 210, 220, 230, 240],
110: [180, 190, 200, 210, 220, 230, 240],
120: [130, 180, 190, 200, 210, 220, 230, 240],
130: [180, 190, 200, 210, 220, 230, 240],
140: [150, 160, 165, 180, 200, 190, 210, 220, 230, 240],
150: [160, 165, 180, 200, 190, 210, 220, 230, 240],
160: [165, 180, 200, 190, 210, 220, 230, 240],
165: [180, 200, 190, 210, 200, 220, 230, 240],
170: [175, 180, 200, 190, 210, 220, 230, 240],
175: [180, 200, 190, 210, 200, 220, 230, 240],
180: [190, 200, 210, 220, 230, 240],
190: [200, 210, 220, 230, 240],
200: [210, 220, 230, 240],
210: [220, 230, 240],
220: [230, 240],
230: [240]}
编辑2:如果df
是带有“ df1”和“ df2”列的输入数据框:
from pprint import pprint
d = dict()
for i, j in zip(df.df1, df.df2):
if i == j:
continue
if j == '-': # <-- this will remove the `-` character in df2
continue
d.setdefault(i, []).append(j)
for k in d:
seen = set()
for v in d[k]:
for val in d.get(v, []):
if val not in seen:
seen.add(val)
d[k].append(val)
df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)
答案 1 :(得分:1)
您好,感谢您的澄清,我有一个具有递归功能的解决方案,您可以尝试。对于大型数据帧,可能效率不高,但效果很好。 该函数返回一个列表,但是您可以根据需要编辑结果系列以将该列表加入字符串中。
def get_related(df1, related):
# get directly related values
next_vals = df.loc[df['df1'] == df1, 'df2'].values.tolist()
# remove links to self (will cause recursion issues)
next_vals = list(set(next_vals) - set([df1]))
# add to running list
related = related + next_vals
# continue to next level
if any(next_val in df['df1'].unique() for next_val in next_vals):
for next_val in next_vals:
related = related + get_related(next_val, related)
# get unique list
return list(set(related))
df['df1'].apply(lambda x: get_related(x, []))
答案 2 :(得分:0)
这应该可以解决问题:
def recursive_walk(df, node):
parents=df.loc[(df['df1']==node) & (df['df2']!=node), 'df2'].tolist()
if(len(parents)==0):
yield node
else:
for parent in parents:
yield parent
lst=[el for el in recursive_walk(df, parent)]
for el in lst:
yield el
df['tree']=df.apply(lambda x: list(set([el for el in recursive_walk(df, x['df2'])]+[x['df2']])), axis=1)
输出:
df1 df2 tree
0 10 20 [40, 50, 20, 90, 30]
1 20 30 [40, 50, 90, 30]
2 30 40 [40, 50]
3 60 80 [80]
4 70 70 [100, 70]
5 40 50 [50]
6 30 90 [90]
7 70 100 [100]
(*)我还检查了扩展的数据帧-相当快,我不会共享输出,因为我的IDE正在截断它;)