Python - 使用两个列标准子集数据集

时间:2017-01-30 07:47:52

标签: python pandas subset

我尝试使用两个列标准对数据集进行子集化,但我的结果会产生错误。任何想法为什么?这是我的代码:

df[(df['locations'] = 'New York City Metro Area') & (2016-09-01 < df['publication_date'] < 2016-09-30 )]

这是我的错误:

    f = lambda x, y: lib.ismember(x, set(values))
  File "pandas\lib.pyx", line 158, in pandas.lib.ismember (pandas\lib.c:5199)
TypeError: unhashable type: 'list'

如果有帮助,我的数据看起来像这样:

df['publication_date'].head()

0    2017-01-30T04:48:11.929095Z
1           2016-11-15T05:30:03Z
2    2017-01-30T04:45:24.861067Z
3    2017-01-30T04:47:41.419255Z
4    2017-01-30T04:49:36.192148Z
Name: publication_date, dtype: object

df['locations'].head()

0      [{'name': 'Kansas City, MO'}]
1         [{'name': 'Evanston, IL'}]
2         [{'name': 'Stamford, CT'}]
3             [{'name': 'Reno, NV'}]
4    [{'name': 'Boston Metro Area'}]
Name: locations, dtype: object

1 个答案:

答案 0 :(得分:1)

我认为您可以先从每个key中提取name dict,然后转换string to_datetime。上次使用boolean indexingbetween

df = pd.DataFrame({'locations':[[{'name': 'Kansas City, MO'}], [{'name': 'Evanston, IL'}], [{'name': 'Stamford, CT'}],[{'name': 'Reno, NV'}],[{'name': 'Boston Metro Area'}]],
                   'publication_date':['2017-01-30T04:48:11.929095Z','2016-11-15T05:30:03Z','2017-01-30T04:45:24.861067Z','2017-01-30T04:47:41.419255Z','2017-01-30T04:49:36.192148Z']})
print (df)
                         locations             publication_date
0    [{'name': 'Kansas City, MO'}]  2017-01-30T04:48:11.929095Z
1       [{'name': 'Evanston, IL'}]         2016-11-15T05:30:03Z
2       [{'name': 'Stamford, CT'}]  2017-01-30T04:45:24.861067Z
3           [{'name': 'Reno, NV'}]  2017-01-30T04:47:41.419255Z
4  [{'name': 'Boston Metro Area'}]  2017-01-30T04:49:36.192148Z

print (type(df.locations.iloc[0]))
<class 'list'>


df.locations = df.locations.apply(lambda x: x[0]['name'])
df.publication_date = pd.to_datetime(df.publication_date)
print (df)
           locations           publication_date
0    Kansas City, MO 2017-01-30 04:48:11.929095
1       Evanston, IL 2016-11-15 05:30:03.000000
2       Stamford, CT 2017-01-30 04:45:24.861067
3           Reno, NV 2017-01-30 04:47:41.419255
4  Boston Metro Area 2017-01-30 04:49:36.192148

print (df[(df['locations'] == 'Boston Metro Area')  & 
          (df['publication_date'].between('2016-09-01', '2018-09-30'))])
           locations           publication_date
4  Boston Metro Area 2017-01-30 04:49:36.192148

query的解决方案:

print (df.query('locations ==  "Boston Metro Area" and  "2016-09-01" < publication_date < "2018-09-30"'))
           locations           publication_date
4  Boston Metro Area 2017-01-30 04:49:36.192148

如果不需要更改列locations中值的结构:

df.publication_date = pd.to_datetime(df.publication_date)
print (df)
                         locations           publication_date
0    [{'name': 'Kansas City, MO'}] 2017-01-30 04:48:11.929095
1       [{'name': 'Evanston, IL'}] 2016-11-15 05:30:03.000000
2       [{'name': 'Stamford, CT'}] 2017-01-30 04:45:24.861067
3           [{'name': 'Reno, NV'}] 2017-01-30 04:47:41.419255
4  [{'name': 'Boston Metro Area'}] 2017-01-30 04:49:36.192148

print (df[(df['locations'].apply(lambda x: x[0]['name']) == 'Boston Metro Area')  & 
          (df['publication_date'].between('2016-09-01', '2018-09-30'))])

                         locations           publication_date
4  [{'name': 'Boston Metro Area'}] 2017-01-30 04:49:36.192148