如何在Python Pandas中选择两个值之间的DataFrame中的行?

时间:2015-07-24 18:56:10

标签: python pandas

我正在尝试将DataFrame df修改为仅包含closing_price列中的值介于99和101之间的行,并尝试使用下面的代码执行此操作。

然而,我收到错误

  

ValueError:系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()

我想知道是否有办法在不使用循环的情况下执行此操作。

df = df[(99 <= df['closing_price'] <= 101)]

8 个答案:

答案 0 :(得分:77)

还要考虑series between

df = df[df['closing_price'].between(99, 101, inclusive=True)]

答案 1 :(得分:53)

您应该使用()对布尔向量进行分组以消除歧义。

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]

答案 2 :(得分:17)

有一个更好的选择 - 使用query()方法:

In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})

In [59]: df
Out[59]:
   closing_price
0            104
1             99
2             98
3             95
4            103
5            101
6            101
7             99
8             95
9             96

In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
   closing_price
1             99
5            101
6            101
7             99

更新:回复评论:

  

我喜欢这里的语法,但在尝试与之结合时摔倒了   expresison; df.query('(mean + 2 *sd) <= closing_price <=(mean + 2 *sd)')

In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
     ...:       " <= closing_price <= " + \
     ...:       "(closing_price.mean() + 2*closing_price.std())"
     ...:

In [162]: df.query(qry)
Out[162]:
   closing_price
0             97
1            101
2             97
3             95
4            100
5             99
6            100
7            101
8             99
9             95

答案 3 :(得分:4)

newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')

mean = closing_price.mean()
std = closing_price.std()

newdf = df.query('@mean <= closing_price <= @std')

答案 4 :(得分:2)

您还可以使用.between()方法

emp = pd.read_csv("C:\\py\\programs\\pandas_2\\pandas\\employees.csv")

emp[emp["Salary"].between(60000, 61000)]
  

输出

enter image description here

答案 5 :(得分:0)

代替此

df = df[(99 <= df['closing_price'] <= 101)]

您应该使用此

df = df[(df['closing_price']>=99 ) & (df['closing_price']<=101)]

我们必须使用NumPy的按位逻辑运算符|,&,〜,^进行复合查询。 另外,括号对于运算符的优先级也很重要。

有关更多信息,您可以访问链接 :Comparisons, Masks, and Boolean Logic

答案 6 :(得分:0)

如果要处理多个值和多个输入,则还可以设置这样的apply函数。在这种情况下,为落入特定范围的GPS位置过滤数据框。

def filter_values(lat,lon):
    if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
        return True
    elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
        return True
    else:
        return False


df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]

答案 7 :(得分:0)

如果必须反复致电pd.Series.between(l,r)(对于不同的lr而言){strong} ,则不必要地重复了许多工作。在这种情况下,对框架/系列进行一次排序然后使用pd.Series.searchsorted()是有益的。我测得的加速高达25倍,请参见下文。

def between_indices(x, lower, upper, inclusive=True):
    """
    Returns smallest and largest index i for which holds 
    lower <= x[i] <= upper, under the assumption that x is sorted.
    """
    i = x.searchsorted(lower, side="left" if inclusive else "right")
    j = x.searchsorted(upper, side="right" if inclusive else "left")
    return i, j

# Sort x once before repeated calls of between()
x = x.sort_values().reset_index(drop=True)
# x = x.sort_values(ignore_index=True) # for pandas>=1.0
ret1 = between_indices(x, lower=0.1, upper=0.9)
ret2 = between_indices(x, lower=0.2, upper=0.8)
ret3 = ...

基准

针对不同的参数n_reps=100pd.Series.between(),测量pd.Series.searchsorted()的重复评估(lower)以及基于upper的方法。在具有Python v3.8.0和Pandas v1.0.3的MacBook Pro 2015上,以下代码导致以下输出

# pd.Series.searchsorted()
# 5.87 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pd.Series.between(lower, upper)
# 155 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Logical expressions: (x>=lower) & (x<=upper)
# 153 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numpy as np
import pandas as pd

def between_indices(x, lower, upper, inclusive=True):
    # Assumption: x is sorted.
    i = x.searchsorted(lower, side="left" if inclusive else "right")
    j = x.searchsorted(upper, side="right" if inclusive else "left")
    return i, j

def between_fast(x, lower, upper, inclusive=True):
    """
    Equivalent to pd.Series.between() under the assumption that x is sorted.
    """
    i, j = between_indices(x, lower, upper, inclusive)
    if True:
        return x.iloc[i:j]
    else:
        # Mask creation is slow.
        mask = np.zeros_like(x, dtype=bool)
        mask[i:j] = True
        mask = pd.Series(mask, index=x.index)
        return x[mask]

def between(x, lower, upper, inclusive=True):
    mask = x.between(lower, upper, inclusive=inclusive)
    return x[mask]

def between_expr(x, lower, upper, inclusive=True):
    if inclusive:
        mask = (x>=lower) & (x<=upper)
    else:
        mask = (x>lower) & (x<upper)
    return x[mask]

def benchmark(func, x, lowers, uppers):
    for l,u in zip(lowers, uppers):
        func(x,lower=l,upper=u)

n_samples = 1000
n_reps = 100
x = pd.Series(np.random.randn(n_samples))
# Sort the Series.
# For pandas>=1.0:
# x = x.sort_values(ignore_index=True)
x = x.sort_values().reset_index(drop=True)

# Assert equivalence of different methods.
assert(between_fast(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_expr(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_fast(x, 0, 1, False).equals(between(x, 0, 1, False)))
assert(between_expr(x, 0, 1, False).equals(between(x, 0, 1, False)))

# Benchmark repeated evaluations of between().
uppers = np.linspace(0, 3, n_reps)
lowers = -uppers
%timeit benchmark(between_fast, x, lowers, uppers)
%timeit benchmark(between, x, lowers, uppers)
%timeit benchmark(between_expr, x, lowers, uppers)