Question

将html视为

<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>

我正在使用Lxml（python）和Xpath并尝试提取标题标记的内容以及链接标记。代码是

page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')

但是这会返回一个空列表。但是，这将返回一个链接元素。

links=x.xpath('//item/link')        #returns <Element link at 0xb6b0ae0c>

有人可以建议如何从链接标记中提取网址吗？

Answer 1

通过etree解析内容，<link>标记已关闭。因此链接标记

没有文本值

演示：

>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>

根据HTML，这不是有效标记。

我认为link标签结构如下：

<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>

Answer 2

你正在使用错误的解析器来完成工作;你没有HTML，你有XML。

正确的HTML解析器将忽略<link>标记的内容，因为在HTML规范中标记始终为空。

使用etree.parse()功能解析您的网址流（无需单独.read()次呼叫）：

response = urllib.urlopen(url)
tree = etree.parse(response)

titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')

您也可以使用etree.fromstring(page)，但将读数留给解析器更容易。

使用xpath从链接标记中提取超链接

2 个答案: