Question

我得到一个带有xpath的html标签，有条件，现在我用text（）得到了值。有没有办法从这个值获取属性？（文字（））

text（）

中的值

document.write("<a href="http://www...">hello</a>");

现在我将获得整条线（到目前为止还可以）。现在我想要从那个值获得/ @ href。

这是我的代码：

code = "...<script>document.write("<a href="http://www...">hello</a>"); </script>..."

doc = lxml.html.fromstring(code)
value = doc.xpath( "//script[contains(text(), 'document.write') and (contains(text(),'href'))]//text()" )

我可以尝试使用正则表达式，但也许还有另一种解决xpath问题的好方法。

由于

Answer 1

您可以通过在regex标记内的文字上调用LH.fromstring来避免使用<script>：

import lxml.html as LH
code = '...<script>document.write("<a href="http://www...">hello</a>"); </script>...'

doc = LH.fromstring(code)
for text in doc.xpath( "//script[contains(text(), 'document.write') and (contains(text(),'href'))]//text()" ):
    script = LH.fromstring(text)
    print(script.xpath('//a/@href'))

产量

['http://www...']

Answer 2

我们必须按照以下步骤获取＆＃34; a＆＃34;的href值。标签来自＆＃34;脚本＆＃34;标记：

获取＆＃34;脚本的文字＆＃34;标签getiterator方法。
再次为＆＃34;脚本＆＃34;的文本创建script_root。标签
查找＆＃34; a＆＃34;的href属性按getiterator方法标记。

＆GT;

code = """"<script>document.write("<a href="http://www...">hello</a>"); </script>"""
from lxml import html
root = html.fromstring(code)
for i in root.getiterator("script"):
    script_root = html.fromstring(i.text)
    for j in script_root.getiterator("a"):
        try:print "href:-", j.attrib["href"]
        except:pass

提取text（）并从中获取属性

2 个答案: