Question

我有一个美国教育数据集：统一项目的数据集。我想找出

9至12年级（列：GRADES_9_12_G）的注册人数少于5000的行数
注册为9至12年级（列：GRADES_9_12_G）的行数介于10,000和20,000之间。

只要if语句中的值正确，我就无法更新计数。

import pandas as pd 
import numpy as np

df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape

df = df.iloc[:, -6] 

for key, value in df.iteritems():
    count = 0
    count1 = 0
    if value < 5000:
        count += 1
    elif value < 20000 and value > 10000:
        count1 += 1

print(str(count) + str(count1))

df看起来像这样

0        196386.0

1         30847.0

2        175210.0

3        123113.0

4       1372011.0

5        160299.0

6        126917.0

7         28338.0

8         18173.0

9        511557.0

10       315539.0

11        43882.0

12        66541.0

13       495562.0

14       278161.0

15       138907.0

16       120960.0

17       181786.0

18       196891.0

19        59289.0

20       189795.0

21       230299.0

22       419351.0

23       224426.0

24       129554.0

25       235437.0

26        44449.0

27        79975.0

28        57605.0

29        47999.0

          ...    

1462          NaN

1463          NaN

1464          NaN

1465          NaN

1466          NaN

1467          NaN

1468          NaN

1469          NaN

1470          NaN

1471          NaN

1472          NaN

1473          NaN

1474          NaN

1475          NaN

1476          NaN

1477          NaN

1478          NaN

1479          NaN

1480          NaN

1481          NaN

1482          NaN

1483          NaN

1484          NaN

1485          NaN

1486          NaN

1487          NaN

1488          NaN

1489          NaN

1490          NaN

1491          NaN

Name: GRADES_9_12_G, Length: 1492, dtype: float64

在输出中我得到

Answer 1

对于Pandas，使用循环几乎总是错误的方法。您可能想要这样的东西：

print(len(df.loc[df['GRADES_9_12_G'] < 5000]))    
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))

Answer 2

我下载了您的数据集，有多种解决方法。首先，如果不需要，您不需要对数据进行子集化。您的问题可以这样解决：

import pandas as pd

df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52

第df.loc[df['GRADES_9_12_G'] < 5000]行告诉熊猫查询数据框以查询df['GRADES_9_12_G']列中小于5000的所有行。然后，我调用python的内置len函数返回返回的长度，该长度输出184。从本质上讲，这是一个布尔屏蔽过程，它为df返回所有满足您条件的True值。

第二个查询df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)] 使用&运算符，该运算符是按位运算符，要求同时满足两个条件才能返回行。然后，我们在其上调用len函数，以获得输出52的行数的整数值。

关闭方法：

import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range

为什么我更改答案中的代码而不使用您的答案？

可以这么简单地说，它更泛泛。在可行的情况下，使用pandas内置组件比使用for循环遍历数据帧更干净，因为这是pandas专门设计的。

数据框列中的iteritems（）

2 个答案: