我在使用lxml.html.parse()
时遇到了一些麻烦:
这是我的代码(缩写):
import lxml.html
class Scraper:
def fetch(self, url):
tree = None
try:
parser = lxml.html.HTMLParser(encoding='utf8')
tree = lxml.html.parse(url, parser)
except IOError as e:
print('ERROR LOADING PAGE: ' + str(e))
return tree
它大部分工作正常,但有时我会遇到很多错误:
ERROR LOADING PAGE:读取文件时出错 'b'http://twitter.com/wordpressdotcom'':b'无法加载外部 实体“http://twitter.com/wordpressdotcom”'
ERROR LOADING PAGE:读取文件时出错 'B'http://www.amazon.com/gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible '': b'无法加载HTTP资源'
ERROR LOADING PAGE:读取文件时出错 'b'http://plugins.trac.wordpress.org/changeset/559098'':b'失败了 加载外部实体 “http://plugins.trac.wordpress.org/changeset/559098”'
我在这里查看了其他问题和答案,但是他们所能提出的建议都是使用urllib - 但是当我尝试它时,这并没有真正帮助。
我想要的是禁用加载“外部实体”,无论它意味着什么。我想要的只是给定URL的html。
答案 0 :(得分:1)
当我嗅到Wireshark时,我看到了这一点:
http://twitter.com/wordpressdotcom
:
GET /wordpressdotcom HTTP/1.0
Host: twitter.com
Accept-Encoding: gzip
HTTP/1.0 301 Moved Permanently
content-length: 0
date: Sat, 01 Feb 2014 12:08:01 UTC
location: https://twitter.com/wordpressdotcom
server: tfe
set-cookie: guest_id=v1%3A139125648190241848; Domain=.twitter.com; Path=/; Expires=Mon, 01-Feb-2016 12:08:01 UTC
http://www.amazon.com/gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible
GET /gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible HTTP/1.0
Host: www.amazon.com
Accept-Encoding: gzip
HTTP/1.1 503 Service Unavailable
Date: Sat, 01 Feb 2014 12:10:49 GMT
Server: Server
Last-Modified: Fri, 30 Nov 2012 01:26:22 GMT
ETag: "3dd-4cfac498acb80-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Content-Length: 599
Connection: close
Content-Type: text/html
...........SMo.0...Wp..]b....M...Pl.....`l.V+K..8M..(;.v.a..S..(.=.m....l.k.u......~...V....b.....j:.U...S.u."..k.|vy....J.P4..fY...x0..7....[Kp....S.Y.O...>B.GKk.c].....0/..wR9.ag.q...F...6hg....M....d........N.vk..Yi}8.r.......V..t
.... !...B.0..f.._9.G...\....OY0...-..{........xZ^.......n~.(8.:.k%1
Z2M+....[.5.Z.2.R..DL.KV.y2.Y...4N...z....Z.N....V........].DV.z^..}..j>W.;..WB.bS.......ba.3.g..G8......".}b...th1....a."`x........>[.@......8-........z.q.{.CJE.@>.d..?...UK...dQ'.J
....KW..v...iK.q.=-AI.?....za7.=/u/.......T.Sf}...\t.iJ. ..8.....U...dg...9..t#.g......Lz.. .?...i.........L]....
适用于http://plugins.trac.wordpress.org/changeset/559098
GET /changeset/559098 HTTP/1.0
Host: plugins.trac.wordpress.org
Accept-Encoding: gzip
HTTP/1.1 302 Found
Date: Sat, 01 Feb 2014 12:13:06 GMT
Server: Apache
Location: https://plugins.trac.wordpress.org/changeset/559098
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 242
Connection: close
Content-Type: text/html; charset=iso-8859-1
..........uO.N.0...+L..eh..L$X;1i@......fR.DI:...n.r.l?.....|.4.u./.......n..-..j..eS^..(...\fd..K2.t..1.,...l.4j.."#<.....3.N^..e.dc..m....F....5.....171......;.AD.Z.c.v.C..w..5v.8.r....\..L.. t..OEi=3..Sm.<.?.....e....*................|7...
lxml
显然无法处理重定向,而对于亚马逊案例,您可能需要使用真正的“用户代理”标头。