Question

我有一个包含3列的数据框：

df:

x       y      z
334     290    3350.0
334     291    3350.5
334     292    3360.1
335     292    3360.1
335     292    3360.1
335     290    3351.0
335     290    3352.5
335     291    3333.1
335     291    3333.1
.
.

我想根据以下两个条件，将row = n到row = n+7的每一行的值检查并解析为新数据框：

df [n]！= df [n + 1]
df [n]！= df [n + 3]
df [n]！= df [n + 5]
df ['x'] [n]
df ['x'] [n]> df ['x'] [n + 3]

如果所有这些都满足，我想编写一个新的数据框：

df_new = pd.concat([df[n], df[n+1], df[n+2], df[n+3], 
df[n+4], df[n+5], df[n+6], df[n+7]])

因此，算法+输出如下：

for df[n] = 0:
1) [334     290    3350.0] != [334     291    3350.5]  True
2) [334     290    3350.0] != [335     292    3360.1]  True
3) [334     290    3350.0] != [335     290    3351.0]  True
4) 335 < 334  False
5) 335 > 335  False

因此，在这种情况下，它将跳过，直到我们遍历数据帧的整个长度并进行匹配为止。

df_new(first iteration) = df_new.concat([....]) = empty row values

在Pandas中，有没有简单的方法可以快速地做到这一点？

Answer 1

A。获得适当的班次：

    n1 = df.shift(-1)
    n2 = df.shift(-2)
    n3 = df.shift(-3)
    n5 = df.shift(-5)

B。满足条件1、2和3：

cond = (df != n1) & (df != n3) & (df != n5)

C。满足条件4、5：

 cond['holder'] = (df.x < n2.x) & (df.x < n3.x)

D。获取布尔系列（我们希望任何行都带有“ True”）：

boolidx = cond.all(axis=1)

E。用于获取结果：

df.loc[boolidx]

Answer 2

我略微更改了您的样本数据，以得到一个积极的结果：

df = pd.DataFrame(data=[
    [ 334, 290, 3350.0 ],
    [ 334, 291, 3350.5 ],
    [ 334, 292, 3360.1 ],
    [ 335, 292, 3360.1 ],
    [ 335, 292, 3360.1 ],
    [ 333, 290, 3351.0 ],
    [ 335, 290, 3352.5 ],
    [ 335, 291, 3333.1 ],
    [ 335, 291, 3333.1 ]], columns=['x', 'y', 'z'])

然后，出于效率考虑，我定义了以下函数：

def roll_win(a, win):
    shape = (a.shape[0] - win + 1, win, a.shape[1])
    strides = (a.strides[0],) + a.strides
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

它会生成一个 3-D 表，其中 2nd 和 3rd 维度为“滚动 windows”来自源 Numpy 数组 a 。窗口的大小为 win ，垂直滑动。这样，连续窗口的处理需要循环运行生成的表格的第一个轴（请参见下文）。

由于使用了 as_strided 函数，它的运行速度明显快于任何“普通” Python循环（将执行时间与其他解决方案进行比较）。

我不能使用 Pandas 提供的滚动窗口，因为它们是创建用于计算一些统计信息，而不用于调用任何用户函数在当前窗口的全部内容上。

然后我将此函数称为：

tbl = roll_win(df.values, 7)

请注意， Numpy 数组必须具有单个元素类型，因此类型被“通用化”为 float ，因为一个源列属于 float 类型。

然后，我们为循环处理每个滚动窗口提供了准备步骤：

res = []    # Result container
idx = 0     # Rolling window index

程序的其余部分是循环：

while idx < len(tbl):
    tt = tbl[idx]  # Get the current rolling window (2-D)
    r0 = tt[0]     # Row 0
    # Condition
    cond = not((r0 == tt[1]).all() and (r0 == tt[3]).all()\
        and (r0 == tt[5]).all()) and tt[0][0] < tt[2][0]\
        and tt[0][0] > tt[3][0]
    if cond:   # OK
        # print(idx)
        # print(tt)
        res.extend(tt)  # Add to result
        idx += 7        # Skip the current result
    else:      # Failed
        idx += 1        # Next loop for the next window

在“正”情况下，我决定从该行开始下一个循环关注当前匹配项（ idx + = 7 ），以避免部分避免源行的重叠集合。如果您不希望使用此功能，则在两种情况下都添加1个 idx 。

出于演示目的，您可以取消注释上面的测试打印输出。

剩下的唯一事情是从行创建目标DataFrame 在 res 中收集：

df2 = pd.DataFrame(data=res, columns=['x', 'y', 'z'], dtype=int)

请注意，只有 x 和 y 列会遵守 dtype = int ，因为 z 列中的值具有小数部分。

在多种条件下对数据框进行切片Python

2 个答案: