这是我的数据框:
Cites_Dogs Dog_Number
DOG45555 DOG123
DOG127 DOG123
DOG7760 DOG126
DOG45 DOG126
DOG559 DOG126
DOG760 DOG126
DOG123 DOG127
DOG789 DOG127
DOG860 DOG127
我已通过以下代码转换为列表:
all_cites_dog = all_cites_dog.groupby('Dog_Number')['Cites_Dogs'].apply(list)
我想删除列表中与索引 DOG123 , DOG126 , DOG127 不匹配的项目。
DOG123 [ 'DOG45555' , 'DOG127']
DOG126 [ 'DOG7760', 'DOG456' , 'DOG559' , 'DOG760']
DOG127 [ 'DOG123' , 'DOG789' , 'DOG860']
我希望看到这样的结果:
DOG123 [ 'DOG127']
DOG126 ['']
DOG127 [ 'DOG123']
我应该怎么做TT?
答案 0 :(得分:1)
在groupby+apply
中使用过滤器:
idx = set(all_cites_dog['Dog_Number'])
all_cites_dog = (all_cites_dog.groupby('Dog_Number')['Cites_Dogs']
.apply(lambda x: list([y for y in x if y in idx])))
print (all_cites_dog)
Dog_Number
DOG123 [DOG127]
DOG126 []
DOG127 [DOG123]
Name: Cites_Dogs, dtype: object
要获得更好的性能,请先过滤boolean indexing
和isin
,然后过滤groupby
,最后添加缺失的不匹配空值:
s = (all_cites_dog[all_cites_dog['Cites_Dogs'].isin(all_cites_dog['Dog_Number'].unique())]
.groupby('Dog_Number')['Cites_Dogs']
.apply(list))
idx = np.setdiff1d(all_cites_dog['Dog_Number'].unique(), s.index)
s1 = pd.Series([[]] * len(idx), index=idx)
print (s1)
DOG126 []
dtype: object
s = s.append(s1).sort_index()
print (s)
DOG123 [DOG127]
DOG126 []
DOG127 [DOG123]
dtype: object
答案 1 :(得分:1)
您可以使用apply
并使用列表推导将元素保留在索引中:
l = all_cites_dog.index
all_cites_dog.apply(lambda x: [i for i in x if i in l])
Dog_Number
DOG123 [DOG127]
DOG126 []
DOG127 [DOG123]
Name: Cites_Dogs, dtype: object
答案 2 :(得分:1)
您可以过滤isin
支票。
(df.set_index('Dog_Number')
.query("Cites_Dogs in index")
.reindex(df.Dog_Number.unique()))
Cites_Dogs
Dog_Number
DOG123 DOG127
DOG126 NaN
DOG127 DOG123
如果需要进一步减少,则可以链接groupby
。
(df.set_index('Dog_Number')
.query("Cites_Dogs in index")
.reindex(df.Dog_Number.unique())
.groupby(level=0)['Cites_Dogs']
.apply(pd.Series.tolist))
Dog_Number
DOG123 [DOG127]
DOG126 [nan]
DOG127 [DOG123]
Name: Cites_Dogs, dtype: object
另一个选项是groupby
和apply
,其中设置了成员资格检查。
s = set(df.Dog_Number)
df.groupby('Dog_Number').Cites_Dogs.apply(lambda x: x[x.isin(s)].tolist())
Dog_Number
DOG123 [DOG127]
DOG126 []
DOG127 [DOG123]
Name: Cites_Dogs, dtype: object
答案 3 :(得分:1)
您可以按照主要步骤进行操作:
Cites_Dogs
过滤数据框。groupby
执行apply
+ list
。NaN
值以保持一致性。这是一个示范:
unq_dogs = df['Dog_Number'].unique()
res = df.loc[df['Cites_Dogs'].isin(unq_dogs]\
.groupby('Dog_Number')['Cites_Dogs'].apply(list)\
.reindex(unq_dogs)\
.fillna(pd.Series([[] for _ in range(len(unq_dogs))], index=unq_dogs))\
.reset_index()
print(res)
Dog_Number Cites_Dogs
0 DOG123 [DOG127]
1 DOG126 []
2 DOG127 [DOG123]
答案 4 :(得分:0)
尝试一下,这是否仅适用于一种班轮解决方案:
df = pd.DataFrame({'Cites_Dogs': ['DOG45555' ,'DOG127' , 'DOG7760' ,'DOG45','DOG559','DOG760','DOG123','DOG789','DOG860'],
'Dog_Number': ['DOG123', 'DOG123', 'DOG126', 'DOG126', 'DOG126', 'DOG126', 'DOG127', 'DOG127', 'DOG127']})
a = ['DOG123', 'DOG126', 'DOG127']
df['Cites_Dogs'][~df['Cites_Dogs'].isin(a)] = np.nan
df.replace([np.nan], '', inplace=True)
df = df.groupby('Dog_Number')['Cites_Dogs'].apply(list)
# and output looks like this
Dog_Number
DOG123 [, DOG127]
DOG126 [, , , ]
DOG127 [DOG123, , ]
Name: Cites_Dogs, dtype: object
谢谢!