如何通过使用熊猫的列表和索引之间的比较来删除列表中的项目?

时间:2018-12-20 10:03:34

标签: python pandas dataframe pandas-groupby

这是我的数据框:

Cites_Dogs  Dog_Number
DOG45555    DOG123
DOG127      DOG123
DOG7760     DOG126
DOG45       DOG126
DOG559      DOG126
DOG760      DOG126
DOG123      DOG127
DOG789      DOG127
DOG860      DOG127

我已通过以下代码转换为列表:

all_cites_dog = all_cites_dog.groupby('Dog_Number')['Cites_Dogs'].apply(list)

我想删除列表中与索引 DOG123 DOG126 DOG127 不匹配的项目。

DOG123   [ 'DOG45555' ,  'DOG127']
DOG126   [ 'DOG7760', 'DOG456' ,  'DOG559' ,  'DOG760']
DOG127   [ 'DOG123' ,  'DOG789' ,  'DOG860']

我希望看到这样的结果:

DOG123   [ 'DOG127']
DOG126   ['']
DOG127   [ 'DOG123']

我应该怎么做TT?

5 个答案:

答案 0 :(得分:1)

groupby+apply中使用过滤器:

idx = set(all_cites_dog['Dog_Number'])
all_cites_dog = (all_cites_dog.groupby('Dog_Number')['Cites_Dogs']
                             .apply(lambda x: list([y for y in x if y in idx])))

print (all_cites_dog)
Dog_Number
DOG123    [DOG127]
DOG126          []
DOG127    [DOG123]
Name: Cites_Dogs, dtype: object

要获得更好的性能,请先过滤boolean indexingisin,然后过滤groupby,最后添加缺失的不匹配空值:

s = (all_cites_dog[all_cites_dog['Cites_Dogs'].isin(all_cites_dog['Dog_Number'].unique())]
             .groupby('Dog_Number')['Cites_Dogs']
             .apply(list))

idx = np.setdiff1d(all_cites_dog['Dog_Number'].unique(), s.index)
s1 = pd.Series([[]] * len(idx), index=idx)
print (s1)
DOG126    []
dtype: object

s = s.append(s1).sort_index()
print (s)
DOG123    [DOG127]
DOG126          []
DOG127    [DOG123]
dtype: object

答案 1 :(得分:1)

您可以使用apply并使用列表推导将元素保留在索引中:

l = all_cites_dog.index
all_cites_dog.apply(lambda x: [i for i in x if i in l])

Dog_Number
DOG123    [DOG127]
DOG126          []
DOG127    [DOG123]
Name: Cites_Dogs, dtype: object

答案 2 :(得分:1)

您可以过滤isin支票。

(df.set_index('Dog_Number')
   .query("Cites_Dogs in index")
   .reindex(df.Dog_Number.unique()))

           Cites_Dogs
Dog_Number           
DOG123         DOG127
DOG126            NaN
DOG127         DOG123

如果需要进一步减少,则可以链接groupby

(df.set_index('Dog_Number')
   .query("Cites_Dogs in index")
   .reindex(df.Dog_Number.unique())
   .groupby(level=0)['Cites_Dogs']
   .apply(pd.Series.tolist))

Dog_Number
DOG123    [DOG127]
DOG126       [nan]
DOG127    [DOG123]
Name: Cites_Dogs, dtype: object

另一个选项是groupbyapply,其中设置了成员资格检查。

s = set(df.Dog_Number)
df.groupby('Dog_Number').Cites_Dogs.apply(lambda x: x[x.isin(s)].tolist())

Dog_Number
DOG123    [DOG127]
DOG126          []
DOG127    [DOG123]
Name: Cites_Dogs, dtype: object

答案 3 :(得分:1)

您可以按照主要步骤进行操作:

  1. 根据Cites_Dogs过滤数据框。
  2. groupby执行apply + list
  3. 按照唯一的狗号重新索引数据框。
  4. 使用空列表替换NaN值以保持一致性。

这是一个示范:

unq_dogs = df['Dog_Number'].unique()

res = df.loc[df['Cites_Dogs'].isin(unq_dogs]\
        .groupby('Dog_Number')['Cites_Dogs'].apply(list)\
        .reindex(unq_dogs)\
        .fillna(pd.Series([[] for _ in range(len(unq_dogs))], index=unq_dogs))\
        .reset_index()

print(res)

  Dog_Number Cites_Dogs
0     DOG123   [DOG127]
1     DOG126         []
2     DOG127   [DOG123]

答案 4 :(得分:0)

尝试一下,这是否仅适用于一种班轮解决方案:

df = pd.DataFrame({'Cites_Dogs':  ['DOG45555' ,'DOG127' , 'DOG7760' ,'DOG45','DOG559','DOG760','DOG123','DOG789','DOG860'],
               'Dog_Number': ['DOG123', 'DOG123', 'DOG126', 'DOG126', 'DOG126', 'DOG126', 'DOG127', 'DOG127', 'DOG127']})
a = ['DOG123', 'DOG126', 'DOG127']

df['Cites_Dogs'][~df['Cites_Dogs'].isin(a)] = np.nan

df.replace([np.nan], '', inplace=True)

df = df.groupby('Dog_Number')['Cites_Dogs'].apply(list)

# and output looks like this
Dog_Number
DOG123      [, DOG127]
DOG126        [, , , ]
DOG127    [DOG123, , ]
Name: Cites_Dogs, dtype: object

谢谢!