Question

我有一个数据框，我希望在另一列中基于np.nan在一列中获得0。这是为了让我根据两个不同的列获得两个不同的计数，这两个列在不同的地方都有nan。我正在使用平均值将数据帧整合为平均值，使用总和来进行分箱。下面的代码可以工作，但.loc行使我的实际数据变得非常慢。

my_df = pd.DataFrame({"a": np.random.random(100),
                   "b": np.random.random(100),
                   "id": np.arange(100)})

my_df['a'][23] = np.nan
my_df['b'][56] = np.nan

my_df['count_type1'] = 1
my_df['count_type2'] = 1

my_df.loc[(my_df.a.isnull()), my_df.count_type1] = 0
my_df.loc[(my_df.b.isnull()), my_df.count_type2] = 0

bins = np.linspace(0, 1, 10)
groups = my_df.groupby(np.digitize(my_df.a, bins))

binned_data_mean = groups.mean()
binned_data_counts = groups.sum()

binned_data_mean['count_type1'] = binned_data_counts['count_type1']
binned_data_mean['count_type2'] = binned_data_counts['count_type2']

有没有更快的方法来达到我想要的目标？

Answer 1

如果你需要一个指标变量，Prob会做这样的事情。

In [28]: %timeit my_df['count_type1'] = my_df.a.where(my_df.a.isnull(),1).fillna(0)
1000 loops, best of 3: 611 µs per loop

这更好

In [47]: %timeit my_df['count_type1'] = my_df.a.notnull().astype(int)
1000 loops, best of 3: 275 µs per loop

使用.loc为binned数据创建计数，慢

1 个答案: