我有一个美国教育数据集:统一项目的数据集。我想找出
只要if语句中的值正确,我就无法更新计数。
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape
df = df.iloc[:, -6]
for key, value in df.iteritems():
count = 0
count1 = 0
if value < 5000:
count += 1
elif value < 20000 and value > 10000:
count1 += 1
print(str(count) + str(count1))
df看起来像这样
0 196386.0
1 30847.0
2 175210.0
3 123113.0
4 1372011.0
5 160299.0
6 126917.0
7 28338.0
8 18173.0
9 511557.0
10 315539.0
11 43882.0
12 66541.0
13 495562.0
14 278161.0
15 138907.0
16 120960.0
17 181786.0
18 196891.0
19 59289.0
20 189795.0
21 230299.0
22 419351.0
23 224426.0
24 129554.0
25 235437.0
26 44449.0
27 79975.0
28 57605.0
29 47999.0
...
1462 NaN
1463 NaN
1464 NaN
1465 NaN
1466 NaN
1467 NaN
1468 NaN
1469 NaN
1470 NaN
1471 NaN
1472 NaN
1473 NaN
1474 NaN
1475 NaN
1476 NaN
1477 NaN
1478 NaN
1479 NaN
1480 NaN
1481 NaN
1482 NaN
1483 NaN
1484 NaN
1485 NaN
1486 NaN
1487 NaN
1488 NaN
1489 NaN
1490 NaN
1491 NaN
Name: GRADES_9_12_G, Length: 1492, dtype: float64
在输出中我得到
00
答案 0 :(得分:1)
对于Pandas,使用循环几乎总是错误的方法。您可能想要这样的东西:
print(len(df.loc[df['GRADES_9_12_G'] < 5000]))
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))
答案 1 :(得分:0)
我下载了您的数据集,有多种解决方法。首先,如果不需要,您不需要对数据进行子集化。您的问题可以这样解决:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52
第df.loc[df['GRADES_9_12_G'] < 5000]
行告诉熊猫查询数据框以查询df['GRADES_9_12_G']
列中小于5000的所有行。然后,我调用python的内置len函数返回返回的长度,该长度输出184。从本质上讲,这是一个布尔屏蔽过程,它为df返回所有满足您条件的True
值。
第二个查询df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)]
使用&
运算符,该运算符是按位运算符,要求同时满足两个条件才能返回行。然后,我们在其上调用len函数,以获得输出52的行数的整数值。
关闭方法:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range
为什么我更改答案中的代码而不使用您的答案?
可以这么简单地说,它更泛泛。在可行的情况下,使用pandas内置组件比使用for循环遍历数据帧更干净,因为这是pandas专门设计的。