Python urllib不会从网站中提取读者评论

时间:2016-10-14 18:26:27

标签: python web-scraping urllib

我正在尝试使用下面显示的代码从以下页面中提取读者评论。但输出html test.html 不包含页面中的任何注释。如何使用Python获取此信息?

http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/

from bs4 import BeautifulSoup
import urllib
import urllib.request
import urllib.parse

req =urllib.request.Request('http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/')
response = urllib.request.urlopen(req)
the_page = response.read()

soup = BeautifulSoup(the_page, 'html.parser')
f = open('test.html', 'w')
f.write(soup.prettify())
f.close()

谢谢!

1 个答案:

答案 0 :(得分:0)

使用您可以模仿的ajax请求检索注释:

enter image description here

你可以看到有很多参数,但下面的内容足以得到一个结果,我会留给你弄清楚如何影响结果:

from json import loads
from urllib.request import  urlopen
from urllib.parse import urlencode

data = {"categoryID":"Production",
    "streamID":"32314064",
    "APIKey":"2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD",
    "callback" :"foo",}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
json_dcts = loads(r.read().decode("utf-8"))["comments"]

print(json_dcts)

这会给你一个包含所有注释,upvotes,negvotes等的dicts列表。如果你想解析密钥,它就在其中一个脚本src='https://cdns.gigya.com/js/socialize.js?apiKey=2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD'内的url中, streamID 位于原始网址中。