Question

我有一个要修剪的熊猫数据框。我想取出该部分为2且标识符不是以数字开头的行。首先，我想数一数。如果我运行

len(analytic_events[analytic_events['section']==2].index)

我得到结果1247669

当我缩小范围并运行它时

len(analytic_events[(analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit())].index)

我得到的答案完全相同：1247669

例如，我知道十个行将其作为标识符

.help.your_tools.subtopic2

不以数字开头，并且15,000行以其作为标识符

240.1007

这样做以数字开头。

为什么我的过滤器传递所有行，而不是仅传递其标识符不是以数字开头的行？

Answer 1

您应该尝试在str系列上使用identifier属性，如下所示：

sum((analytic_events[(analytic_events['section']==2)) & ~(analytic_events['identifier'].str[0].str.isdigit())].index)

Answer 2

使用str处理文本函数，使用str[0]处理字符串的第一个值，使用最后一个sum处理计数True的值：

mask= ((analytic_events['section']==2) & 
       ~(analytic_events['identifier'].str[0].str.isdigit()))

print (mask.sum())

如果性能很重要且没有缺失值，请使用列表理解：

arr = ~np.array([x[0].isdigit() for x in analytic_events['identifier']])
mask = ((analytic_events['section']==2) & arr)

编辑：

为什么我的过滤器传递所有行，而不是仅传递其标识符不是以数字开头的行？

如果测试解决方案的输出：

analytic_events = pd.DataFrame(
                        {'section':[2,2,2,3,2],
                         'identifier':['4hj','8hj','gh','th','h6h']})

print (analytic_events)
   section identifier
0        2        4hj
1        2        8hj
2        2         gh
3        3         th
4        2        h6h

获取列的第一个值：

print ((analytic_events['identifier'][0]))
4hj

检查标量的位数是否为

print ((analytic_events['identifier'][0].isdigit()))
False

print (~(analytic_events['identifier'][0].isdigit()))
-1

带有第一个遮罩的链将其转换为True：

print ((analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit()))
0     True
1     True
2     True
3    False
4     True
Name: section, dtype: bool

所以它的工作原理就像不存在第二个面具一样

print (analytic_events['section']==2)
0     True
1     True
2     True
3    False
4     True
Name: section, dtype: bool

使用非熊猫中的矢量化逻辑过滤框架

2 个答案: