我尝试使用两个列标准对数据集进行子集化,但我的结果会产生错误。任何想法为什么?这是我的代码:
df[(df['locations'] = 'New York City Metro Area') & (2016-09-01 < df['publication_date'] < 2016-09-30 )]
这是我的错误:
f = lambda x, y: lib.ismember(x, set(values))
File "pandas\lib.pyx", line 158, in pandas.lib.ismember (pandas\lib.c:5199)
TypeError: unhashable type: 'list'
如果有帮助,我的数据看起来像这样:
df['publication_date'].head()
0 2017-01-30T04:48:11.929095Z
1 2016-11-15T05:30:03Z
2 2017-01-30T04:45:24.861067Z
3 2017-01-30T04:47:41.419255Z
4 2017-01-30T04:49:36.192148Z
Name: publication_date, dtype: object
df['locations'].head()
0 [{'name': 'Kansas City, MO'}]
1 [{'name': 'Evanston, IL'}]
2 [{'name': 'Stamford, CT'}]
3 [{'name': 'Reno, NV'}]
4 [{'name': 'Boston Metro Area'}]
Name: locations, dtype: object
答案 0 :(得分:1)
我认为您可以先从每个key
中提取name
dict
,然后转换string
to_datetime
。上次使用boolean indexing
与between
:
df = pd.DataFrame({'locations':[[{'name': 'Kansas City, MO'}], [{'name': 'Evanston, IL'}], [{'name': 'Stamford, CT'}],[{'name': 'Reno, NV'}],[{'name': 'Boston Metro Area'}]],
'publication_date':['2017-01-30T04:48:11.929095Z','2016-11-15T05:30:03Z','2017-01-30T04:45:24.861067Z','2017-01-30T04:47:41.419255Z','2017-01-30T04:49:36.192148Z']})
print (df)
locations publication_date
0 [{'name': 'Kansas City, MO'}] 2017-01-30T04:48:11.929095Z
1 [{'name': 'Evanston, IL'}] 2016-11-15T05:30:03Z
2 [{'name': 'Stamford, CT'}] 2017-01-30T04:45:24.861067Z
3 [{'name': 'Reno, NV'}] 2017-01-30T04:47:41.419255Z
4 [{'name': 'Boston Metro Area'}] 2017-01-30T04:49:36.192148Z
print (type(df.locations.iloc[0]))
<class 'list'>
df.locations = df.locations.apply(lambda x: x[0]['name'])
df.publication_date = pd.to_datetime(df.publication_date)
print (df)
locations publication_date
0 Kansas City, MO 2017-01-30 04:48:11.929095
1 Evanston, IL 2016-11-15 05:30:03.000000
2 Stamford, CT 2017-01-30 04:45:24.861067
3 Reno, NV 2017-01-30 04:47:41.419255
4 Boston Metro Area 2017-01-30 04:49:36.192148
print (df[(df['locations'] == 'Boston Metro Area') &
(df['publication_date'].between('2016-09-01', '2018-09-30'))])
locations publication_date
4 Boston Metro Area 2017-01-30 04:49:36.192148
query
的解决方案:
print (df.query('locations == "Boston Metro Area" and "2016-09-01" < publication_date < "2018-09-30"'))
locations publication_date
4 Boston Metro Area 2017-01-30 04:49:36.192148
如果不需要更改列locations
中值的结构:
df.publication_date = pd.to_datetime(df.publication_date)
print (df)
locations publication_date
0 [{'name': 'Kansas City, MO'}] 2017-01-30 04:48:11.929095
1 [{'name': 'Evanston, IL'}] 2016-11-15 05:30:03.000000
2 [{'name': 'Stamford, CT'}] 2017-01-30 04:45:24.861067
3 [{'name': 'Reno, NV'}] 2017-01-30 04:47:41.419255
4 [{'name': 'Boston Metro Area'}] 2017-01-30 04:49:36.192148
print (df[(df['locations'].apply(lambda x: x[0]['name']) == 'Boston Metro Area') &
(df['publication_date'].between('2016-09-01', '2018-09-30'))])
locations publication_date
4 [{'name': 'Boston Metro Area'}] 2017-01-30 04:49:36.192148