我有DataFrame crimes_df
:
>> crimes_df.size
6198374
我需要使用相同的"s_lat"
,"s_lon"
和"date"
来计算事件。我使用groupby:
crimes_count_df = crimes_df\
.groupby(["s_lat", "s_lon", "date"])\
.size()\
.to_frame("crimes")
但它没有给出正确的答案,因为如果你计算总和,你会发现大多数事件都丢失了:
>> crimes_count_df.sum()
crimes 476798
dtype: int64
我也尝试过agg:
crimes_count_df = crimes_df\
.groupby(["s_lat", "s_lon", "date"])\
.agg(['count'])
但结果相同:
crimes_count_df.sum()
Unnamed: 0 count 476798
area count 476798
arrest count 476798
description count 476798
domestic count 476798
latitude count 476798
location_description count 475712
longitude count 476798
time count 476798
type count 476798
编辑: 我发现这个聚合函数有一个限制!请参阅以下命令:
crimes_df.head(100) \
.groupby(["s_lat", "s_lon", "date"]) \
.size() \
.to_frame("crimes")\
.sum()
crimes 100
dtype: int64
crimes_df.head(1000) \
.groupby(["s_lat", "s_lon", "date"]) \
.size() \
.to_frame("crimes")\
.sum()
crimes 1000
dtype: int64
crimes_df.head(10000) \
.groupby(["s_lat", "s_lon", "date"]) \
.size() \
.to_frame("crimes")\
.sum()
crimes 10000
dtype: int64
crimes_df.head(100000) \
.groupby(["s_lat", "s_lon", "date"]) \
.size() \
.to_frame("crimes")\
.sum()
crimes 100000
dtype: int64
crimes_df.head(1000000) \
.groupby(["s_lat", "s_lon", "date"]) \
.size() \
.to_frame("crimes")\
.sum()
crimes 476798
dtype: int64
crimes_df.head(10000000) \
.groupby(["s_lat", "s_lon", "date"]) \
.size() \
.to_frame("crimes")\
.sum()
crimes 476798
dtype: int64
crimes_df.head(476799) \
.groupby(["s_lat", "s_lon", "date"]) \
.size() \
.to_frame("crimes")\
.sum()
crimes 476798
dtype: int64
如果你想自己检查,这里是带有数据的文件:
https://www.dropbox.com/s/ib0kq16t4c2e5a2/CrimeDataWithSquare.csv?dl=0
您可以这样加载:
from pandas import read_csv, DataFrame
crimes_df = read_csv("CrimeDataWithSquare.csv")
信息
crimes_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 476798 entries, 0 to 476797
Data columns (total 13 columns):
Unnamed: 0 476798 non-null int64
area 476798 non-null float64
arrest 476798 non-null bool
date 476798 non-null object
description 476798 non-null object
domestic 476798 non-null bool
latitude 476798 non-null float64
location_description 475712 non-null object
longitude 476798 non-null float64
time 476798 non-null object
type 476798 non-null object
s_lon 476798 non-null float64
s_lat 476798 non-null float64
dtypes: bool(2), float64(5), int64(1), object(5)
memory usage: 40.9+ MB
答案 0 :(得分:2)
我认为这不是一个错误。 size方法并不总是等于行数。让我们来看看你的情况:
import pandas as pd
crimes_df = pd.read_csv("CrimeDataWithSquare.csv")
crimes_df.shape
#(476798, 13)
crimes_df.shape[0] * crimes_df.shape[1]
#6198374
crimes_df.size
#6198374
len(crimes_df)
#476798
哪些文档说明了size
方法?
number of elements in the NDFrame
通常,Dataframe有2个维度(X行乘Y列)。因此,数据帧size
方法返回X乘以Y(其中的元素数)。
如果您有一个列,该怎么办?
crimes_df2 = crimes_df.iloc[:, 0]
len(crimes_df2) == crimes_df2.size
#True
这是你期待的结果。
答案 1 :(得分:0)
试试这个:
np.random.seed(0)
df = pd.DataFrame({
'a': [1, 2, 3] * 4,
'b': np.random.choice(['q','w','a'], size=12),
'c': 1
})
df
a b c
0 1 q 1
1 2 w 1
2 3 q 1
3 1 w 1
4 2 w 1
5 3 a 1
6 1 q 1
7 2 a 1
8 3 q 1
9 1 q 1
10 2 q 1
11 3 a 1
df.groupby(['a', 'b']).count()
c
a b
1 q 3
w 1
2 a 1
q 1
w 2
3 a 2
q 2
答案 2 :(得分:0)
您的某些数据集是否可能包含缺失值,例如日期? 如果我没记错的话,一个None不会分组(虽然我可能错了)。 你尝试过使用fillna(0)吗?
crimes_count_df = crimes_df\
.groupby(["s_lat", "s_lon", "date"])\
.size()\
.reset_index()\
.fillna(0)\
.to_frame("crimes")