Question

我正在尝试提取 HTML 标签中包含的文本，以便构建一个 python defaultdict。为此，我需要清除所有 xpath 和/或 HTML 数据并仅获取文本，我可以使用 /text() 完成，除非它是一个 href .

我如何刮取物品：

for item in response.xpath(
    "//*[self::h3 or self::p or self::strong or self::a[@href]]"):

如果我打印上面的内容，没有提取尝试，它看起来如何：

<Selector xpath='//*[self::h3 or self::p or self::a[@href]]' data='<h3> Some text here ...'>
<Selector xpath='//*[self::h3 or self::p or self::a[@href]]' data='<a href="https://some.url.com...'>

我想提取“这里有一些文字”和“https://some.url.com”

我如何尝试提取文本：

item = item.xpath("./text()").get()
print(item):

结果：

Some text here
None

“无”是我希望看到的地方：https://some.url.com，在尝试了网上建议的各种方法后，我无法让它工作。

Answer 1

尝试使用此行提取文本或 @href：

item = item.xpath("./text() | ./@href").get()

当标签为@href 时，xpath text() 返回“None”

1 个答案: