Question

我在这里苦苦挣扎，我希望获取以下数据，按日期分组，然后检查组内的行以确定该组是否有与之关联的任何位置数据，如果是，则将其解压缩

我的数据样本：

id,dates,text,place
1,2017-01-26 01:06:47,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))"
2,2017-01-26 01:05:51,text,"Place(country_code='US', full_name='Manhattan, NY', place_type='city', name='Manhattan', contained_within=[], _api=<tweepy.api.API object at 0x10336f320>, attributes={}, country='United States', bounding_box=BoundingBox(type='Polygon', coordinates=[[[-74, 40], [-73, 40], [-73, 40], [-74, 40]]], _api=<tweepy.api.API object at 0x10336f320>))"
4,2017-01-23 01:38:29,text,
5,2017-01-23 01:36:53,text,

我首先加载csv并将日期分组

import pandas as pd
import matplotlib.pyplot as plt
import datetime

fig = plt.figure(figsize=(5,5))
df1 = pd.read_csv('data.csv')
df = df1[['dates','place']]
df['dates']=pd.to_datetime(df['dates'],format='%Y-%m-%d')
df.index=df['dates']

grp = pd.groupby(df,by=[df.index.year,df.index.month,df.index.day])
for date,group in grp:
    print(date)
    print(group)

这将产生一个如下所示的数据框：

(2017, 1, 26)
                                  dates  \
dates
2017-01-26 01:06:47 2017-01-26 01:06:47
2017-01-26 01:05:51 2017-01-26 01:05:51

                                                                 place
dates
2017-01-26 01:06:47  Place(country_code='US', full_name='Manhattan,...
2017-01-26 01:05:51                                                NaN

这里是我遇到过滤/条件问题的地方，我的目标是拥有一个可以保存到csv的数据帧，如下所示：

date, item_count, has_location, location
2017-01-26, 2, yes, Manhattan
2017-01-23, 2, no, na

最好的方法是什么？感谢

Answer 1

我认为你可以使用：

extract name首先列place，然后groupby dt.date {dtype列为dates { {1}}，to_datetime可以删除）并按datetime列聚合，例如size和id列first。由insert创建的上一个numpy.where新列：

place

pandas groupBy date然后将日期和字符串过滤到新的数据帧中

1 个答案: