Question

我有一个pandas数据框，它是从一个具有唯一t_id和s_id的表构建的，我想从这个数据框中删除所有s_id的country_date为null的t_id的记录。

数据如：

t_id s_id country_date
T1   S1   jan
T1   S2   mar
T2   S1   
T2   S2
T3   S2   jan
T3   S3

结果：

t_id s_id country_date
T1   S1   jan
T1   S2   mar
T3   S2   jan
T3   S3

我写了以下这一行，但那是错的：

raw_data.groupby("t_id").country_date.max().notnull()

请您提供按上述标准过滤数据框记录的方法。另外，打印过滤掉的t_ids。

Answer 1

使用isnull和all：

df.groupby('t_id').filter(lambda x: ~x.country_date.isnull().all())

如果这些空白是＆＃39;＆＃39;而不是你可能需要的NaN：

df.replace('',pd.np.nan).groupby('t_id').filter(lambda x: ~x.country_date.isnull().all())

输出：

  t_id s_id country_date
0   T1   S1          jan
1   T1   S2          mar
4   T3   S2          jan
5   T3   S3          NaN

并且，要查看丢弃的那些ID：

df.groupby('t_id').filter(lambda x: x.country_date.isnull().all())['t_id'].unique()

输出：

array(['T2'], dtype=object)