Question

我有一个数据集加载到pandas DataFrame中，如下所示：

n=2
np.random.seed(1)
dt = pd.DataFrame((np.random.rand(5, 2*n+1)*1000).round(), columns=['id', 'x_0', 'y_0', 'x_1', 'y_1'])
>>> dt
    id  x_0  y_0  x_1  y_1
0  417  720    0  302  147
1   92  186  346  397  539
2  419  685  204  878   27
3  670  417  559  140  198
4  801  968  313  692  876

[5 rows x 5 columns]

我知道这只适用于n = 2，但是在这一点上我不知道如何构建任何n的列名（但我想这是另一个话题的问题）。

一般来说，我可以有n个x和y列的块（这是每月数据）。

我需要的是检查x_i和y_i的值是否在同一个月内超过一定数量，如果在n个月中的任何一个中都返回1，则返回0。

所以，我正在摆弄：

>>> (dt[range(1, 2*n+1, 2)] > 400)
     x_0    x_1
0   True  False
1  False  False
2   True   True
3   True  False
4   True   True

[5 rows x 2 columns]
>>> (dt[range(2, 2*n+1, 2)] > 300)
     y_0    y_1
0  False  False
1   True   True
2  False  False
3   True  False
4   True   True

[5 rows x 2 columns]

我想检查x_i值是否超过400且y_i超过300。这会产生两个x和y值（n列宽）的DataFrame，这是可以的。但是当我尝试：

(dt[range(1, 2*n+1, 2)] > 400) & (dt[range(2, 2*n+1, 2)] > 300)

它不适用＆amp;运算符by elements，但返回NaNs的2 * n DataFrame：

   x_0  x_1  y_0  y_1
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

我显然在这里遗漏了一些东西。问题是背后的逻辑是什么以及如何使其发挥作用。

如果我开始工作，我会尝试使用any()向我们提供apply()功能。

任何建议都表示赞赏。

*编辑这里还有一个解决问题的R片段。也许我的Python代码的R方法是＆＃34;加载＆＃34;我在这里。

> n=2
> dt <- data.frame(id = c(417, 92, 419, 670, 801),
+                     x_0 = c(720, 186, 685, 417, 968),
+                     y_0 = c(0, 346, 204, 559, 313),
+                     x_1 = c(302, 397, 878, 140, 692),
+                     y_1 = c(147, 539, 27, 198, 876))

> (x <- (dt[,seq(2, 2*n+1, by=2)] > 400) & (dt[,seq(3, 2*n+1, by=2)] > 300))
       x_0   x_1
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,]  TRUE FALSE
[5,]  TRUE  TRUE
> (result <- apply(x, 1, any, na.rm=T))
[1] FALSE FALSE FALSE  TRUE  TRUE

Answer 1

这使用列名索引，并且不在列上使用逻辑运算符，而不是使用apply函数遍历行：

n=2
import numpy as np
import pandas as pd

dt = pd.DataFrame((np.random.rand(5, 2*n+1)*1000).round(), columns=['id', 'x_0', 'y_0', 'x_1', 'y_1'])
print dt

def check_x(x):
    value=0
    columns_with_x = [col for col in x.index if 'x_' in col]
    columns_with_y = [col for col in x.index if 'y_' in col]
    for each_col_x in columns_with_x:
        if x[each_col_x] > 400:
            for each_col_y in columns_with_y:
                if x[each_col_y] > 300:
                    value=1
    return value

checked = dt.apply(check_x, axis=1)

print checked

输出：

    id  x_0  y_0  x_1  y_1
0  251  525  976  743  206
1  324  354  238  413   93
2   21  999  731  416  431
3  652  926  131  510  627
4  124  387  747  972  678

0    1
1    0
2    1
3    1
4    1
dtype: int64

列的块上的python pandas逻辑运算符

1 个答案: