我写了一个像这样的Python WebScrapper:
import urllib2,cookielib
from BeautifulSoup import BeautifulSoup
url = 'http://www.nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G&utm_campaign=website&utm_source=sendgrid.com&utm_medium=email'
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
req = urllib2.Request(url, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
现在在这个网址中有一个我需要抓取的表,但是当我尝试运行此代码时,返回的html会丢失大量<tr>
和<td>
个标记。如何打印完整的html?
答案 0 :(得分:0)
假设您的问题实际上是“如何从表中获取数据?”不是“当我在网络浏览器中查看HTML时如何获取HTML”,正如评论中所指出的那样,解决方案是使用Firebug或Chrome的开发人员工具查看所需内容的来源:
import requests
import json
r = requests.get("http://www.nseindia.com/live_market/dynaContent/"
"live_analysis/gainers/niftyGainers1.json")
data_as_json = json.loads(r.content)
for stock_info in data_as_json['data']:
for key, value in stock_info.items():
print key, value
(我更喜欢使用requests而不是urllib2来处理HTTP。)