如何防止LXML错误“无法加载外部实体”

时间:2014-02-01 10:26:51

标签: python html linux parsing lxml

我在使用lxml.html.parse()时遇到了一些麻烦:

这是我的代码(缩写):

import lxml.html

class Scraper:

    def fetch(self, url):

        tree = None

        try:
            parser = lxml.html.HTMLParser(encoding='utf8')
            tree = lxml.html.parse(url, parser)
        except IOError as e:
            print('ERROR LOADING PAGE: ' + str(e))

        return tree

它大部分工作正常,但有时我会遇到很多错误:

  

ERROR LOADING PAGE:读取文件时出错   'b'http://twitter.com/wordpressdotcom'':b'无法加载外部   实体“http://twitter.com/wordpressdotcom”'

     

ERROR LOADING PAGE:读取文件时出错   'B'http://www.amazon.com/gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible '':   b'无法加载HTTP资源'

     

ERROR LOADING PAGE:读取文件时出错   'b'http://plugins.trac.wordpress.org/changeset/559098'':b'失败了   加载外部实体   “http://plugins.trac.wordpress.org/changeset/559098”'

我在这里查看了其他问题和答案,但是他们所能提出的建议都是使用urllib - 但是当我尝试它时,这并没有真正帮助。

我想要的是禁用加载“外部实体”,无论它意味着什么。我想要的只是给定URL的html。

1 个答案:

答案 0 :(得分:1)

当我嗅到Wireshark时,我看到了这一点:

http://twitter.com/wordpressdotcom

GET /wordpressdotcom HTTP/1.0
Host: twitter.com
Accept-Encoding: gzip

HTTP/1.0 301 Moved Permanently
content-length: 0
date: Sat, 01 Feb 2014 12:08:01 UTC
location: https://twitter.com/wordpressdotcom
server: tfe
set-cookie: guest_id=v1%3A139125648190241848; Domain=.twitter.com; Path=/; Expires=Mon, 01-Feb-2016 12:08:01 UTC

http://www.amazon.com/gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible

GET /gp/offer-listing/0375714634/ref=la_B001IGSNMM_1_9_cp_1_pap_olp/185-7720102-5178158?s=books&ie=UTF8&qid=1391249475&sr=1-9&condition=collectible HTTP/1.0
Host: www.amazon.com
Accept-Encoding: gzip

HTTP/1.1 503 Service Unavailable
Date: Sat, 01 Feb 2014 12:10:49 GMT
Server: Server
Last-Modified: Fri, 30 Nov 2012 01:26:22 GMT
ETag: "3dd-4cfac498acb80-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Content-Length: 599
Connection: close
Content-Type: text/html

...........SMo.0...Wp..]b....M...Pl.....`l.V+K..8M..(;.v.a..S..(.=.m....l.k.u......~...V....b.....j:.U...S.u."..k.|vy....J.P4..fY...x0..7....[Kp....S.Y.O...>B.GKk.c].....0/..wR9.ag.q...F...6hg....M....d........N.vk..Yi}8.r.......V..t
.... !...B.0..f.._9.G...\....OY0...-..{........xZ^.......n~.(8.:.k%1
Z2M+....[.5.Z.2.R..DL.KV.y2.Y...4N...z....Z.N....V........].DV.z^..}..j>W.;..WB.bS.......ba.3.g..G8......".}b...th1....a."`x........>[.@......8-........z.q.{.CJE.@>.d..?...UK...dQ'.J
....KW..v...iK.q.=-AI.?....za7.=/u/.......T.Sf}...\t.iJ. ..8.....U...dg...9..t#.g......Lz.. .?...i.........L]....

适用于http://plugins.trac.wordpress.org/changeset/559098

GET /changeset/559098 HTTP/1.0
Host: plugins.trac.wordpress.org
Accept-Encoding: gzip

HTTP/1.1 302 Found
Date: Sat, 01 Feb 2014 12:13:06 GMT
Server: Apache
Location: https://plugins.trac.wordpress.org/changeset/559098
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 242
Connection: close
Content-Type: text/html; charset=iso-8859-1

..........uO.N.0...+L..eh..L$X;1i@......fR.DI:...n.r.l?.....|.4.u./.......n..-..j..eS^..(...\fd..K2.t..1.,...l.4j.."#<.....3.N^..e.dc..m....F....5.....171......;.AD.Z.c.v.C..w..5v.8.r....\..L.. t..OEi=3..Sm.<.?.....e....*................|7...

lxml显然无法处理重定向,而对于亚马逊案例,您可能需要使用真正的“用户代理”标头。

您应该使用其他库来下载页面内容,例如requestsurllib(2),然后将此HTML提供给lxml.html