Question

我想从新闻网站中检索不同的类别。我正在使用BeautifulSoup从右侧获得文章标题。如何循环到网站左侧的各种类别？我刚刚开始学习这种代码，而不是理解它是如何工作的。任何帮助将不胜感激。这是我正在研究的网站。 http://query.nytimes.com/search/sitesearch/#/ * / 下面是我的代码，它从右侧返回各种文章的标题：

import json
from bs4 import BeautifulSoup
import urllib
from urllib2 import urlopen 
from urllib2 import HTTPError 
from urllib2 import URLError
import requests


resp = urlopen("https://query.nytimes.com/svc/add/v1/sitesearch.json")

content = resp.read()
j = json.loads(content)

articles = j['response']['docs']
headlines = [ article['headline']['main'] for article in articles ]
for article in articles:
    print article['headline']['main']

Answer 1

如果我理解正确，您可以通过更改api查询来获取这些文章：

import requests

data_range = ['24hours', '7days', '30days', '365days']
news_feed = {}

with requests.Session() as s:

   for rng in data_range:
        news_feed[rng] = s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&facet=true'.format(rng)).json()

并访问以下值：

print(news_feed) #or print(news_feed['30days'])

修改

要查询附加页面，您可以尝试：

import requests data_range = ['7days'] news_feed = {} news_list = [] page = 1 with requests.Session() as s: for rng in data_range: while page < 20: #this is limited to 120 news_list.append(s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&page={}&facet=true'.format(rng, page)).json()) page += 1 news_feed[rng] = news_list for new in news_feed['7days']: print(new)

Answer 2

首先，您可以使用requests模块及其内置的.json()函数，而不是使用urllib + json来解析JSON响应。 / p>

示例：

import requests

r = requests.get("https://query.nytimes.com/svc/add/v1/sitesearch.json")
json_data = r.json()
# rest of the code is same

现在，要抓取Date Range标签，请先转到Developer Tools＆gt; Network＆gt; XHR。然后，单击任何选项卡。例如，如果单击Past 24 Hours选项卡，您将看到对此URL发出的AJAX请求：

http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=24hoursago&facet=true

如果点击Past 7 Days，您会看到以下网址：

http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=7daysago&facet=true

通常，您可以使用以下格式设置这些网址格式：

url = "http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}&facet=true"
past_24_hours = url.format('24hoursago')

r = requests.get(past_24_hours)
data = r.json()

这将为您提供JSON对象data中的所有NEWS项目。

例如，你可以获得这样的新闻标题：

for item in data['response']['docs']:
    print(item['headline']['main'])

输出：

Austrian Lawmakers Vote to Hinder Smoking Ban in Restaurants and Bars
Soccer-Argentine World Cup Winner Houseman Dies Aged 64
Response to UK Spy Attack Not Expected at EU Summit: French Source
Florida Man Reunites With Pet Cat Lost 14 Years Ago
Citigroup Puts Restrictions on Gun Sales
EU Exemptions From U.S. Steel Tariffs 'Possible but Not Certain': French Source
Trump Initiates Trade Action Against China
Trump’s Trade Threats Put China’s Leader on the Spot
Poland Plans Concessions in Judicial Reforms to Ease EU Concerns: Lawmaker
Florida Bridge Collapse Victim's Family Latest to Sue

如何使用BeautifulSoup

2 个答案: