带有Requests和lxml的Python Scrape网站..

时间:2015-09-08 00:56:29

标签: python lxml python-requests scrape pyquery

以此为出发点.. http://docs.python-guide.org/en/latest/scenarios/scrape/

from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.text)

一切都按预期工作..但是,......

from lxml import html
import requests

page = requests.get('http://www.streetinsider.com/ipo_history.php?type=upcoming')
tree = html.fromstring(page.text)

给出了这个错误......

File "<string>", line unknown
XMLSyntaxError: line 1: Document is empty

使用pyquery ....

from pyquery import PyQuery as pq
from lxml import etree,html
import requests


response = pq(url='http://www.streetinsider.com/ipo_history.php?type=upcoming')

doc = pq(response.content)

抛出此错误......

File "<string>", line unknown
XMLSyntaxError: line 1504: Unexpected end tag : h2

从网页上获取表格的任何帮助。

1 个答案:

答案 0 :(得分:2)

某些网站会检测并阻止某些用户代理。 (类似网络机器人。)www.streetinsider.com背后的网络应用程序似乎检测到python请求的用户代理,并(被动地)阻止其HTTP请求。

您可以使用headers = {'User-Agent':''} requests.get函数调用参数设置user-aget。

page = requests.get('http://www.streetinsider.com/ipo_history.php', \
                    headers={'User-Agent': 'tester'}, \
                    params={'type':'upcoming'})