我有一个数据框,其中的列created_at
和entities
看起来像这样
created_at entities
2017-10-29 23:06:28 {'hashtags': [{'text': 'OPEC', 'indices': [0, ...
2017-10-29 22:28:20 {'hashtags': [{'text': 'Iraq', 'indices': [21,...
2017-10-29 20:01:37 {'hashtags': [{'text': 'oil', 'indices': [58, ...
2017-10-29 20:00:14 {'hashtags': [{'text': 'oil', 'indices': [38, ...
2017-10-27 08:44:30 {'hashtags': [{'text': 'Iran', 'indices': [19,...
2017-10-27 08:44:10 {'hashtags': [{'text': 'Oil', 'indices': [17, ...
2017-10-27 08:43:13 {'hashtags': [{'text': 'Oil', 'indices': [0, 4...
2017-10-27 08:43:00 {'hashtags': [{'text': 'Iran', 'indices': [19,.
我想计算每天的实体数。基本上我想收到类似
created_at number_of_entities
2017-10-29 4
2017-10-27 4
该怎么做?我有pandas 0.23.4
答案 0 :(得分:3)
给予
>>> df
created_at entities
0 2017-10-29 23:06:28 1
1 2017-10-29 22:28:20 2
2 2017-10-29 20:01:37 3
3 2017-10-29 20:00:14 4
4 2017-10-27 08:44:30 5
5 2017-10-27 08:44:10 6
6 2017-10-27 08:43:13 7
7 2017-10-27 08:43:00 8
使用
>>> df.dtypes
created_at datetime64[ns]
entities int64
dtype: object
您可以发出:
>>> pd.PeriodIndex(df['created_at'], freq='D').value_counts()
2017-10-29 4
2017-10-27 4
Freq: D, Name: created_at, dtype: int64
jezrael在评论中建议了一种没有PeriodIndex
构造函数的更好方法:
>>> df['created_at'].dt.to_period('D').value_counts()
2017-10-27 4
2017-10-29 4
通过一些其他重命名来匹配您的输出,它开始看起来像jezrael的解决方案。 ;)
>>> datecol = 'created_at'
>>> df[datecol].dt.to_period('D').value_counts().rename_axis(datecol).reset_index(name='number_of_entities')
created_at number_of_entities
0 2017-10-27 4
1 2017-10-29 4
或者,您可以将索引设置为日期,然后设置resample
:
>>> df.set_index('created_at').resample('D').size()
created_at
2017-10-27 4
2017-10-28 0
2017-10-29 4
Freq: D, dtype: int64
...,如果有必要转换为确切的输出:
>>> resampled = df.set_index('created_at').resample('D').size()
>>> resampled[resampled != 0].reset_index().rename(columns={0: 'number_of_entities'})
created_at number_of_entities
0 2017-10-27 4
1 2017-10-29 4
更多上下文:resample
对于任意时间间隔(例如“五分钟”)特别有用。以下示例直接取自Wes McKinney的书“ Python for Data Analysis”。
>>> N = 15
>>> times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
>>> df = pd.DataFrame({'time': times, 'value': np.arange(N)})
>>>
>>> df
time value
0 2017-05-20 00:00:00 0
1 2017-05-20 00:01:00 1
2 2017-05-20 00:02:00 2
3 2017-05-20 00:03:00 3
4 2017-05-20 00:04:00 4
5 2017-05-20 00:05:00 5
6 2017-05-20 00:06:00 6
7 2017-05-20 00:07:00 7
8 2017-05-20 00:08:00 8
9 2017-05-20 00:09:00 9
10 2017-05-20 00:10:00 10
11 2017-05-20 00:11:00 11
12 2017-05-20 00:12:00 12
13 2017-05-20 00:13:00 13
14 2017-05-20 00:14:00 14
>>>
>>> df.set_index('time').resample('5min').size()
time
2017-05-20 00:00:00 5
2017-05-20 00:05:00 5
2017-05-20 00:10:00 5
Freq: 5T, dtype: int64
答案 1 :(得分:2)
使用groupby.size
# Convert to datetime dtype if you haven't.
df1.created_at = pd.to_datetime(df1.created_at)
df2 = df1.groupby(df1.created_at.dt.date).size().reset_index(name='number_of_entities')
print (df2)
created_at number_of_entities
0 2017-10-27 4
1 2017-10-29 4
答案 2 :(得分:2)
为您提供数据:
In [3]: df
Out[3]:
created_at entities
0 2017-10-29 23:06:28 {'hashtags': [{'text': 'OPEC', 'indices': [0, ...
1 2017-10-29 22:28:20 {'hashtags': [{'text': 'Iraq', 'indices': [21,...
2 2017-10-29 20:01:37 {'hashtags': [{'text': 'oil', 'indices': [58, ...
3 2017-10-29 20:00:14 {'hashtags': [{'text': 'oil', 'indices': [38, ...
4 2017-10-27 08:44:30 {'hashtags': [{'text': 'Iran', 'indices': [19,...
5 2017-10-27 08:44:10 {'hashtags': [{'text': 'Oil', 'indices': [17, ...
6 2017-10-27 08:43:13 {'hashtags': [{'text': 'Oil', 'indices': [0, 4...
7 2017-10-27 08:43:00 {'hashtags': [{'text': 'Iran', 'indices': [19,.
您可以按以下方式使用groupby(..).count()来获取所需的内容:
In [4]: df[["created_at"]].groupby(pd.to_datetime(df["created_at"]).dt.date).count().rename(columns={"created_at":"number_of_entities"}).reset_index()
...:
Out[4]:
created_at number_of_entities
0 2017-10-27 4
1 2017-10-29 4
注意:
如果created_at
列已经是日期时间格式,则可以简单地使用以下内容:
df[["created_at"]].groupby(df.created_at.dt.date).count().rename(columns={"created_at":"number_of_entities"}).reset_index()
答案 3 :(得分:2)
您可以使用floor
或date
进行删除,然后使用value_counts
进行计数,最后rename_axis
和reset_index
进行2列DataFrame
:
df = (df['created_at'].dt.floor('d')
.value_counts()
.rename_axis('created_at')
.reset_index(name='number_of_entities'))
print (df)
created_at number_of_entities
0 2017-10-29 4
1 2017-10-27 4
或者:
df = (df['created_at'].dt.date
.value_counts()
.rename_axis('created_at')
.reset_index(name='number_of_entities'))
如果要避免在value_counts
中通过传递参数sort=False
进行默认排序:
df = (df['created_at'].dt.floor('d')
.value_counts(sort=False)
.rename_axis('created_at')
.reset_index(name='number_of_entities'))
答案 4 :(得分:1)
您可以使用df.groupby(df.created_at.dt.day)
按天分组。
对于计算计数的函数,由于我们需要整行,因此您的数据结构看起来很奇怪。