假设我有一个两列数据框,其中第一列是会议的ID,第二列是该会议中一个参与者的ID。像这样:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
我想找到每个人的前15名共同参与者。例如:我想知道最常有15个人参加与布拉德的会议。
作为中间步骤,我编写了一个脚本,该脚本采用原始数据框并制作了人对人的数据框,如下所示:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
但是我不确定此中间步骤是否必要。另外,它要花很多时间才能运行(据我估计,这需要数周的时间!)。这是怪兽:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
有什么想法吗?
答案 0 :(得分:1)
所以这是使用合并然后对列进行排序的一种方法
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999