Question

我正在为我的IT工作开发一个项目，该项目要求我使用Scrapy / XPath编写一个刮刀，从相当简单的HTML页面获取一组相当简单的数据。我已经按照我想要的方式完成了所有工作，除了一些斜体文本（被抓取的网站用于语言教育程序，并且在这个特定的文本字段中有许多斜体的例子）没有显示出来

以下是我在斜体问题出现之前成功使用的代码：

rawTitles = []
for sel in response.xpath('//h2[@class="video"]'):
    rawTitle = sel.xpath('text()').extract()
    rawTitles.append(rawTitle[0])
print rawTitles

我得到以下的＆＃34; print rawTitles＆＃34;：

[u'\n', u'\nVariations in Making ', u'\nMaking ', u'\nCommon Rice and Meat Dishes', u'\nRumens and ']

我想要的是这样的：

[u'\n<i>Mjadra</i>', u'\nVariations in Making <i>Mansaf</i>', u'\nMaking <i>Maqloobeh</i>', u'\nCommon Rice and Meat Dishes', u'\nRumens and <i>Mahashi</i>']

如果文字HTML标签不能包含在输出中，我至少会喜欢要包含的明文。单词应该是空白空间似乎不是我能做的最好的。

有谁知道我想尝试什么？如果我没有提供足够的信息，请告诉我。提前谢谢。

编辑：这是一个表条目的示例，我需要从中提取信息：

<td width="25%" valign="top" align="center">
<h2 class="video"><img src="content/pl_makingfood_mjadrah.jpg"     alt="Thumbnail image from video" width="160" height="120" /><br /><br />
<i>Mjadra</i></h2>      <p class="video">Video <br />

<a href="content/pl_makingfood_mjadrah.rm" class="main">real</a>&nbsp;&nbsp;
<a href="content/pl_makingfood_mjadrah.mp4" class="main" target="_blank">mp4</a><br /><br />

Palestinian Arabic &amp; English <br />
<a href="content/pl_makingfood_mjadrah.doc" target="_blank" class="main">  doc </a>&nbsp; &nbsp; 
<a href="content/pl_makingfood_mjadrah.pdf" target="_blank" class="main">  pdf </a></p>
</td>

Answer 1

在text()调用某个元素时，只获取顶级文本节点，同时您想要查看每个子元素，请使用.//text()：< / p>

rawTitles = response.xpath('//h2[@class="video"]//text()').extract()

然后，您可以使用rawTitles加入str.join()列表的项目，但我建议您查看Item Loaders以及输入和输出处理器 - 有Join()个处理器在这种情况下适合。

或者，要在评论中遵循Paul的建议，请使用string() XPath函数：

rawTitles = response.xpath('string(//h2[@class="video"])').extract_first()

Answer 2

让我们看看scrapy shell中的不同提取模式，从您的示例HTML构建选择器：

>>> import scrapy
>>> t = '''<td width="25%" valign="top" align="center">
... <h2 class="video"><img src="content/pl_makingfood_mjadrah.jpg"     alt="Thumbnail image from video" width="160" height="120" /><br /><br />
... <i>Mjadra</i></h2>      <p class="video">Video <br />
... 
... <a href="content/pl_makingfood_mjadrah.rm" class="main">real</a>&nbsp;&nbsp;
... <a href="content/pl_makingfood_mjadrah.mp4" class="main" target="_blank">mp4</a><br /><br />
... 
... Palestinian Arabic &amp; English <br />
... <a href="content/pl_makingfood_mjadrah.doc" target="_blank" class="main">  doc </a>&nbsp; &nbsp; 
... <a href="content/pl_makingfood_mjadrah.pdf" target="_blank" class="main">  pdf </a></p>
... </td>'''
>>> selector = scrapy.Selector(text=t, type="html")

首先，让我们循环<h2 class="video">元素（使用CSS选择器），并提取循环中每个标题的字符串表示形式：

>>> for h2 in selector.css('h2.video'):
...     print(h2.xpath('string()').extract())
... 
['\nMjadra']

我们丢失了<i>信息。

让我们尝试只获取文本节点（使用text()节点测试）：

>>> for h2 in selector.css('h2.video'):
...     print(h2.xpath('text()').extract())
... 
['\n']

比以前更糟糕的是，我们没有在<i>元素中获取文本节点。（实际上，text()仅选择直接子文本节点，而不选择儿童的子节点

让我们试试.//，a.k.a ./descendant-or-self::node()/快捷方式：

>>> for h2 in selector.css('h2.video'):
...     print(h2.xpath('.//text()').extract())
... 
['\n', 'Mjadra']

没有比使用XPath的string()好多了。

现在，让我们使用node()节点测试，捕获元素和文本节点：

>>> for h2 in selector.css('h2.video'):
...     print(h2.xpath('node()').extract())
... 
['<img src="content/pl_makingfood_mjadrah.jpg" alt="Thumbnail image from video" width="160" height="120">', '<br>', '<br>', '\n', '<i>Mjadra</i>']

这样更好，但我们有你可能不想要的<img>个标签。所以我们只选择文本节点和<i> s：

>>> for h2 in selector.css('h2.video'):
...     print(h2.xpath('./node()[self::text() or self::i]').extract())
... 
['\n', '<i>Mjadra</i>']
>>>

您可能希望从每个标题中获取单个字符串。所以使用Python的join()是一个选项：

>>> for h2 in selector.css('h2.video'):
...     print( "".join(h2.xpath('./node()[self::text() or self::i]').extract()) )
... 

<i>Mjadra</i>
>>>

如何在XPath中包含格式化文本？

2 个答案: