我想从新闻网站中检索不同的类别。我正在使用BeautifulSoup从右侧获得文章标题。如何循环到网站左侧的各种类别?我刚刚开始学习这种代码,而不是理解它是如何工作的。任何帮助将不胜感激。这是我正在研究的网站。 http://query.nytimes.com/search/sitesearch/#/ * / 下面是我的代码,它从右侧返回各种文章的标题:
import json
from bs4 import BeautifulSoup
import urllib
from urllib2 import urlopen
from urllib2 import HTTPError
from urllib2 import URLError
import requests
resp = urlopen("https://query.nytimes.com/svc/add/v1/sitesearch.json")
content = resp.read()
j = json.loads(content)
articles = j['response']['docs']
headlines = [ article['headline']['main'] for article in articles ]
for article in articles:
print article['headline']['main']
答案 0 :(得分:2)
如果我理解正确,您可以通过更改api查询来获取这些文章:
import requests
data_range = ['24hours', '7days', '30days', '365days']
news_feed = {}
with requests.Session() as s:
for rng in data_range:
news_feed[rng] = s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&facet=true'.format(rng)).json()
并访问以下值:
print(news_feed) #or print(news_feed['30days'])
修改强>
要查询附加页面,您可以尝试:
import requests
data_range = ['7days']
news_feed = {}
news_list = []
page = 1
with requests.Session() as s:
for rng in data_range:
while page < 20: #this is limited to 120
news_list.append(s.get('http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}ago&page={}&facet=true'.format(rng, page)).json())
page += 1
news_feed[rng] = news_list
for new in news_feed['7days']:
print(new)
答案 1 :(得分:1)
首先,您可以使用requests
模块及其内置的.json()
函数,而不是使用urllib
+ json
来解析JSON响应。 / p>
示例:
import requests
r = requests.get("https://query.nytimes.com/svc/add/v1/sitesearch.json")
json_data = r.json()
# rest of the code is same
现在,要抓取Date Range
标签,请先转到Developer Tools
&gt; Network
&gt; XHR
。然后,单击任何选项卡。例如,如果单击Past 24 Hours
选项卡,您将看到对此URL发出的AJAX请求:
http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=24hoursago&facet=true
如果点击Past 7 Days
,您会看到以下网址:
http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date=7daysago&facet=true
通常,您可以使用以下格式设置这些网址格式:
url = "http://query.nytimes.com/svc/add/v1/sitesearch.json?begin_date={}&facet=true"
past_24_hours = url.format('24hoursago')
r = requests.get(past_24_hours)
data = r.json()
这将为您提供JSON对象data
中的所有NEWS项目。
例如,你可以获得这样的新闻标题:
for item in data['response']['docs']:
print(item['headline']['main'])
输出:
Austrian Lawmakers Vote to Hinder Smoking Ban in Restaurants and Bars
Soccer-Argentine World Cup Winner Houseman Dies Aged 64
Response to UK Spy Attack Not Expected at EU Summit: French Source
Florida Man Reunites With Pet Cat Lost 14 Years Ago
Citigroup Puts Restrictions on Gun Sales
EU Exemptions From U.S. Steel Tariffs 'Possible but Not Certain': French Source
Trump Initiates Trade Action Against China
Trump’s Trade Threats Put China’s Leader on the Spot
Poland Plans Concessions in Judicial Reforms to Ease EU Concerns: Lawmaker
Florida Bridge Collapse Victim's Family Latest to Sue