Python / HTML如何在没有cookie顾问的情况下抓取网页内容?

时间:2015-09-02 14:07:11

标签: javascript python html cookies

我正在尝试用Python抓取网页的内容,我能够获得我需要的所有内容,但在返回的HTML中还有cookie顾问。我想删除它,但我不知道如何从XPath查询或HTML内容中排除它。在这里,您可以在页面的页脚中找到顾问。 Webpage here

#!C:/Python27/python
from lxml import etree
import requests
import cgi

fs = cgi.FieldStorage()
q =fs.getfirst ("URL")

page = requests.get(q)

if q.find("http://www.dlib.org") != -1:
    tree = etree.HTML(page.text)
    element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
    p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
    tree = etree.fromstring(page.content, p)
    element = tree.xpath('.//*[@id="content"]')

content = etree.tostring(element[0])

print "Content-type: text\n\n"
print content.strip()

1 个答案:

答案 0 :(得分:1)

对于您指定的页面,Cookie顾问程序存在于divid=cookiesAlert。您可以使用lxml.xpath()搜索div并将其删除,如下所示:

if q.find("http://www.dlib.org") != -1:
    tree = etree.HTML(page.text)
    element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
else:
    p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
    tree = etree.fromstring(page.content, p)
    element = tree.xpath('.//*[@id="content"]')
    cookies_alert = element[0].xpath('.//*[@id="cookiesAlert"]')
    for ca in cookies_alert:
        ca.getparent().remove(ca)