为什么我不能将groupby结果分配回原始DataFrame?

时间:2017-10-01 18:26:27

标签: python pandas csv dataframe

使用apply()方法过滤数据框按预期工作,但是当我将结果分配给新列时,新列具有NaN值(屏幕截图为pfa)。

但如果我注释掉apply()语句,那么我可以看到violent_crime_count列的值。为什么呢?

数据来源:https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9/data

#Load data from CSV 
crimes_2015_today_orig = pd.read_csv('/Users/vishnu/data/chicago_crime_dataset/Crimes_-_2015.csv', index_col='Date', parse_dates=True)

# create a filter values 
various_drug_off =  ['POSS: CANNABIS 30GMS OR LESS', 'POSS: HEROIN(WHITE)']

crimes_2015_drug_possession = crimes_2015_today_orig.copy()
crimes_2015_drug_possession['drug_possession'] = ''
crimes_2015_drug_possession = crimes_2015_drug_possession[crimes_2015_drug_possession.Description.apply(lambda x : x in various_drug_off)]

crimes_2015_drug_possession['drug_possession'] = crimes_2015_drug_possession.groupby(pd.TimeGrouper('D')).count()

# create another column to do count on total count violent crime based on arrest column.
crimes_2015_drug_possession['violent_crime_count'] = ''
crimes_2015_drug_possession['violent_crime_count'] = crimes_2015_drug_possession[crimes_2015_drug_possession.Arrest == True].groupby(pd.TimeGrouper('D')).count()

enter image description here

1 个答案:

答案 0 :(得分:1)

取自https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9/data

的数据

首先,我建议使用df.isin,它的速度要快得多:

m = crimes_2015_drug_possession.Description.isin(various_drug_off)
m.head(5)
Date
2015-01-01 00:00:00    False
2015-11-24 17:30:00    False
2015-05-19 01:12:00    False
2015-01-01 00:00:00    False
2015-06-24 06:00:00     True
Name: Description, dtype: bool

crimes_2015_drug_possession['drug_possession'] = m

对于groupby操作,请观察:

crimes_2015_drug_possession[crimes_2015_drug_possession.Arrest == True].groupby(pd.TimeGrouper('D')).count().shape
(365, 21)

请注意,它不是单个列,但您尝试将其分配给单个列。现在,我相信你想要的是计算Arrest s 的数量:

c = crimes_2015_drug_possession.groupby(pd.TimeGrouper('D')).Arrest.count()
c.head(5)     
Date
2015-01-01    1092
2015-01-02     671
2015-01-03     648
2015-01-04     513
2015-01-05     520
Freq: D, Name: Arrest, dtype: int64

这仍然是一栏,但是......

c.shape
(365,)

crimes_2015_drug_possession.shape
(263447, 21)

他们的身材不平等。不等大小的分配将导致按索引分配,不匹配的值将替换为NaN。 groupby操作的结果不能分配回原始文件。