Question

好的，所以我已经非常接近，需要帮助越过终点线。我有两段文字，我想用Scrapy抓住。格式如下：

<html> <div id="product-description"> Blah blah blah text text text reads: 874620. more text text. Brand: " Nintendo" Condition: " Good" </div> </html>

到目前为止，我只能抓住粗体标题（品牌：，条件:)，而不是我真正想要的文字（Nintendo，Good）。与正则表达式类似，我只是抓取“读取：”而不是紧跟在我之后的字符串（874620）。我就是这样的地方：

response.xpath('//div[@id="product-description"]//p/b').extract_first() response.xpath('//div[@id="product-description"]//p').re(r'reads.*')

Answer 1

对于Nintendo, Good值，您可以使用following-sibling功能：

In [1]: sel.xpath('//div[@id="product-description"]//b/following-sibling::text()[1]').extract()
Out[1]: [u'\n      " Nintendo"\n      ', u'\n      " Good"\n    ']

您可以添加正则表达式以避免丑陋的空格：

In [2]: sel.xpath('//div[@id="product-description"]//b/following-sibling::text()[1]').re('"(.+)"')
Out[2]: [u' Nintendo', u' Good']

关于正则表达式的第二个问题，请尝试以下方法：

In [3]: sel.xpath('//div[@id="product-description"]//p').re('reads: (\d+)')
Out[3]: [u'874620']

Answer 2

您可以提取标记的全文，然后运行正则表达式从中提取相关信息

示例代码：

import re
from scrapy.selector import Selector

html = '''<html>
  <div id="product-description">
    <p>
      Blah blah blah text text text reads: 874620. more text text.
      <br>
      <br>
      <b>Brand:</b>
      " Nintendo"
      <br>
      <b>Condition:</b>
      " Good"
    </p>
  </div>
</html>'''

extracted_text = Selector(text=html).xpath('//div[@id="product-description"]//p//text()').extract()
text = u''.join(extracted_text)

regex = r'reads:\s*(?P<reads>\d+).*Brand:\s*" (?P<brand>\w+)".*Condition:\s*" (?P<condition>\w+)"'
results = re.search(regex, text, flags=re.DOTALL).groupdict()

results['reads'] = int(results['reads'])
print(results)

此代码输出：

{'reads': 874620, 'brand': u'Nintendo', 'condition': u'Good'}

更新

让我们看看这段代码的作用：

<强>的xpath

首先，extracted_text使用xpath 获取//div[@id="product-description"]//p//text()标记内的所有文字这个xpath意味着：

给我所有id属性匹配的div ＆＃34;产品描述＆＃34;
给我上面div中的所有p标签
从这些p标签中获取文本

注意：//代替/表示搜索标签还包括儿童及其子女等等。

运行此xpath会返回我们找到的标记内每个标记文本的字符串列表。

在xpath之后，我们使用u''.join(extracted_text)将此列表连接成大字符串。

获取我们想要的全文后，我们可以运行正则表达式从中提取相关数据。

<强>正则表达式

让我们试着打破正则表达式，看看它意味着什么：

reads:\s*(?P<reads>\d+).*Brand:\s*" (?P<brand>\w+)".*Condition:\s*" (?P<condition>\w+)"

reads:\s*(?P<reads>\d+) - 找到一个以reads:开头的字符串。然后是零个或多个空格\s*，并创建一个名为的读取的匹配组，其中包含\d+，表示一个或多个数字。

.*Brand:\s*" (?P<brand>\w+)" - 上面跟着零个或多个字符（任意字符），字符串Brand:再次\s*零个或多个空格后跟双引号和单个空格{ {1}}。在此之后创建另一个名为 brand 的组，其中包含"，表示一个或多个字母数字字母。

\w+ - 这与上面第二部分相同，为条件创建匹配组

使用标志DOTALL执行正则表达式，这意味着.*Condition:\s*" (?P<condition>\w+)"字符匹配所有字符（包括新行），因为我们的匹配跨多行。

运行上面的正则表达式后，我们提取3个匹配的组，并从读取匹配的字符串转换为int。

我将此示例上传到here，其中包含详细信息及其互动信息，您可以自行尝试。

在<b>标记之间的<p>标记内抓取文本以及正则表达式问题

2 个答案:

更新