Question

我想从某个日期开始从世界上许多来源导入文章。

import requests
url = ('https://newsapi.org/v2/top-headlines?'
       'country=us&'
       'apiKey=de9e19b7547e44c4983ad761c104278f')
response = requests.get(url)

response_dataframe = pd.DataFrame(response.json())

articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z'}
print(articles)

但是我得到了

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-0f21f2f50907> in <module>
      2 response_dataframe['articles'][1]['publishedAt']
      3 
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
      5 print(articles)

<ipython-input-84-0f21f2f50907> in <setcomp>(.0)
      2 response_dataframe['articles'][1]['publishedAt']
      3 
----> 4 articles = {article for article in response_dataframe['articles'] if article['publishedAt'] >= '2018-01-04T11:30:00Z'}
      5 print(articles)

TypeError: unhashable type: 'dict'

因此，如何通过选择这些键来选择一系列文章？预期的输出是一个按日期和报纸对文章进行排序的数据框。

              The New York Times                                The Washington Post                                The Financial Times  
2007-01-01    . What Sticks from '06. Somalia Orders Islamis... New Ebola Vaccine Gives 100 Percent Protecti...
2007-01-02    . Heart Health: Vitamin Does Not Prevent Death... Flurry of Settlements Over Toxic Mortgages M...
2007-01-03    . Google Answer to Filling Jobs Is an Algorith... Jason Miller Backs Out of White House Commun...
2007-01-04    . Helping Make the Shift From Combat to Commer... Wielding Claims of ‘Fake News,’ Conservative...
2007-01-05    . Rise in Ethanol Raises Concerns About Corn a... When One Party Has the Governor’s Mansion an
...

我的Python版本是3.6.6

Answer 1

您正在过滤字典，然后尝试将它们放在集合中。您的预期结果不需要重复数据删除，因此避免该错误的最简单方法是使用列表理解。只需将{...}大括号替换为方括号即可：

articles = [article for article in response_dataframe['articles'] if article['publishedAt'] >= '2019-01-04T11:30:00Z']

但是，如果要将数据放入数据帧中进行处理，则使用pandas.io.json.json_normalize() function会更好。它可以通过通常从JSON源加载的列表和字典结构为您生成数据框。

首先将所需的商品数据加载到数据框中，然后可以从中进行过滤和重新排列；以下代码使用从date信息派生的新publishAt列将所有数据加载到单个数据帧中：

import pandas as pd
from pandas.io.json import json_normalize

df = json_normalize(response.json(), 'articles')

# make the datetime column a native type, and add a date-only column
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['date'] = df['publishedAt'].dt.date

# move source dictionary into separate columns rather than dictionaries
source_columns = df['source'].apply(pd.Series).add_prefix('source_')
df = pd.concat([df.drop(['source'], axis=1), source_columns], axis=1)

这将为您提供一个包含所有商品信息的数据框，作为具有本机类型的完整数据框，其中包含列author，content，description，publishedAt，{ {1}映射中的{1}}，date，title，url和urlToImage和source_id列。

我注意到 API 已经允许您按日期进行过滤，我会依靠它而不是本地进行过滤，因为通过让API为您提供较小的数据集，您可以节省时间和带宽。该API还使您可以应用排序，这也是一个好主意。

要按日期和源名称对行进行分组，您必须pivot the dataframe；日期应为索引，列应为源名称，标题应为值：

source_name

但是这失败了，因为这种格式每天每个来源的空间不能超过一个标题：

source

在提供给我的JSON数据中，仅今天有多篇CNN和Fox News文章。

您可以将多个标题聚合到列表中：

df.pivot(index='date', columns='source_name', values='title')

对于“今天”的默认20个结果，这给了我：

ValueError: Index contains duplicate entries, cannot reshape

就个人而言，我只是将数据框限制为日期，标题和源名称，并带有日期索引：

pd.pivot_table(df,
    index='date', columns='source_name', values='title',
    aggfunc=list)

以上按日期和来源分类，因此来自同一来源的多标题标题被分组。

根据日期键值过滤字典

1 个答案: