Question

我正在使用此代码解析网页

s=requests.Session()
response = s.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'})
#Give output of both HTML blocks
tex=response.text

parser=etree.HTMLParser()
tree= etree.parse(StringIO(tex), parser)
result = etree.tostring(tree.getroot(),pretty_print=True, method="html")
#Give output of first HTML blocks
print result

当我print response.text时，它会提供这样的输出

<html><head><script></script></head></html>          
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
</html>

表示页面有2个<html>标签，现在当我运行解析器时，它只解析第一个<html>块并忽略第二个，因此输出

result = etree.tostring(tree.getroot(),pretty_print=True, method="html")是

<html><head><script></script></head></html>

我需要第二个<html>标记而不是第一个标记，请指导我如何忽略第一个<html>块和解析器第二个。我尝试过所有类型的解析器，似乎都没有解决它。你能解释一下为什么会这样吗。

使用2个标签解析网页

0 个答案: