Question

I have a problem with an if statement with my data from the dataframe. Somehow performing an if statement for years > 3years somehow all values larger than 9Y are not showing up and it is not clear why. The output looks like the following:

4Y
5Y
6Y
7Y
8Y
9Y
4Y
5Y
6Y
7Y
8Y
9Y

My code looks like the following:

import pandas as pd

df = pd.DataFrame([
['2015-02-09', '1Y',    2.241],
['2015-02-09', '1Y',    2.413],
['2015-02-09', '2Y',    2.228],
['2015-02-09', '2Y', 2.289],
['2015-02-09', '3Y', 2.263],
['2015-02-09', '3Y', 2.371],
['2015-02-09', '4Y', 2.413],
['2015-02-09', '5Y', 2.487],
['2015-02-09', '6Y', 2.578],
['2015-02-09', '7Y', 2.655],
['2015-02-09', '8Y', 2.74959],
['2015-02-09', '9Y', 2.81729],
['2015-02-09', '10Y',   2.853],
['2015-02-09', '12Y',   2.942],
['2015-02-09', '15Y',   3.047],
['2015-02-09', '20Y',   3.165],
['2015-02-09', '25Y',   3.225],
['2015-02-09','30Y',    3.225],
['2015-02-09', '1Y',    9.5],
['2015-02-09', '2Y',    8.75],
['2015-02-09', '3Y',    8.5],
['2015-02-09', '4Y',    8.13],
['2015-02-09', '5Y',    7.75],
['2015-02-09', '6Y',    7.63],
['2015-02-09', '7Y',    7.5],
['2015-02-09', '8Y',    7.45],
['2015-02-09','9Y',     7.25],
['2015-02-09', '10Y',   7.125],
['2015-02-09', '12Y',   7.08],
['2015-02-09', '15Y',   7.04],
['2015-02-09', '20Y',   6.435],
['2015-02-09', '25Y',   5.83],
['2015-02-09', '30Y',   5.45]
], columns=['date', 'year', 'values'])

for index, row in df.iterrows():
    if row['year'] > '3Y':
        print(row['year'])

Answer 1

有问题是您按字典顺序比较字符串，因此10Y < 3Y。解决方案是将值转换为整数。

df['mask'] = df['year'].str.extract('(\d+)', expand=False).astype(int) > 3

print (df)
          date year   values   mask
0   2015-02-09   1Y  2.24100  False
1   2015-02-09   1Y  2.41300  False
2   2015-02-09   2Y  2.22800  False
3   2015-02-09   2Y  2.28900  False
4   2015-02-09   3Y  2.26300  False
5   2015-02-09   3Y  2.37100  False
6   2015-02-09   4Y  2.41300   True
7   2015-02-09   5Y  2.48700   True
8   2015-02-09   6Y  2.57800   True
9   2015-02-09   7Y  2.65500   True
10  2015-02-09   8Y  2.74959   True
11  2015-02-09   9Y  2.81729   True
12  2015-02-09  10Y  2.85300   True
13  2015-02-09  12Y  2.94200   True
14  2015-02-09  15Y  3.04700   True
15  2015-02-09  20Y  3.16500   True
16  2015-02-09  25Y  3.22500   True
17  2015-02-09  30Y  3.22500   True
18  2015-02-09   1Y  9.50000  False
19  2015-02-09   2Y  8.75000  False
20  2015-02-09   3Y  8.50000  False
21  2015-02-09   4Y  8.13000   True
22  2015-02-09   5Y  7.75000   True
23  2015-02-09   6Y  7.63000   True
24  2015-02-09   7Y  7.50000   True
25  2015-02-09   8Y  7.45000   True
26  2015-02-09   9Y  7.25000   True
27  2015-02-09  10Y  7.12500   True
28  2015-02-09  12Y  7.08000   True
29  2015-02-09  15Y  7.04000   True
30  2015-02-09  20Y  6.43500   True
31  2015-02-09  25Y  5.83000   True
32  2015-02-09  30Y  5.45000   True

来自@CristiFati评论的循环解决方案：

for index, row in df.iterrows():
    if int(row["year"][:-1]) > 3:
        print(row['year'])

或者使用正则表达式：

import re
for index, row in df.iterrows():
    if int(re.search(r'\d+', row["year"]).group()) > 3:   
        print(row['year'])

也可以先创建整数列：

df['year-int'] = df['year'].str.extract('(\d+)', expand=False).astype(int)
for index, row in df.iterrows():
    if row["year-int"] > 3:
        print(row['year'])

Answer 2

比较>时，strings符号适用不同的规则。尝试将其转换为int作为数据框中的新列，然后打印> 3。

If statement string data from dataframe does not work for larger years

2 个答案: