Question

请检查站点：

https://www.americanberkshire.com/california.html

p标签中都有

我想与每个元素分开，但是我可以找到有效的方法

# -*- coding: utf-8 -*-
import scrapy


class AmericanberkshireSpider(scrapy.Spider):
    name = 'americanberkshire'
    allowed_domains = ['americanberkshire.com']
    start_urls = ['https://www.americanberkshire.com/california.html']

    def parse(self, response):
        lists=

Answer 1

也许如果您使用xpath 2.0，则可以在选择器中使用正则表达式，例如//p[matches(text(),'[\w\s]+\([\w+]\)','i')]。或者尝试像这样进行迭代（不完全是代码，仅是示例）：

for sel in response.css('p'):
    txt = sel.css('::text').get()
    if not txt or not re.match('[\w\s]+\([\w+]\)', txt):
         continue
    # do what you need with selector sel

Answer 2

def parse(self, response):
    for red_paragraph in response.xpath('//p[re:test(text(), "\([A-Z]{3,}\)")]'):
        paragraphs = [red_paragraph]
        for paragraph in red_paragraph.xpath('./following-sibling::p'):
            if not paragraph.xpath('string(.)').extract_first().strip():
                break
            paragraphs.append(paragraph)
        # In each iteration reaching here, paragraphs will contain a list of
        # related paragraphs.

我如何从p标签生成列表？

2 个答案: