我如何从p标签生成列表?

时间:2018-11-29 13:56:29

标签: web-scraping scrapy web-crawler

请检查站点:

https://www.americanberkshire.com/california.html

p标签中都有

我想与每个元素分开,但是我可以找到有效的方法

# -*- coding: utf-8 -*-
import scrapy


class AmericanberkshireSpider(scrapy.Spider):
    name = 'americanberkshire'
    allowed_domains = ['americanberkshire.com']
    start_urls = ['https://www.americanberkshire.com/california.html']

    def parse(self, response):
        lists=

2 个答案:

答案 0 :(得分:2)

也许如果您使用xpath 2.0,则可以在选择器中使用正则表达式,例如//p[matches(text(),'[\w\s]+\([\w+]\)','i')]。 或者尝试像这样进行迭代(不完全是代码,仅是示例):

for sel in response.css('p'):
    txt = sel.css('::text').get()
    if not txt or not re.match('[\w\s]+\([\w+]\)', txt):
         continue
    # do what you need with selector sel

答案 1 :(得分:1)

def parse(self, response):
    for red_paragraph in response.xpath('//p[re:test(text(), "\([A-Z]{3,}\)")]'):
        paragraphs = [red_paragraph]
        for paragraph in red_paragraph.xpath('./following-sibling::p'):
            if not paragraph.xpath('string(.)').extract_first().strip():
                break
            paragraphs.append(paragraph)
        # In each iteration reaching here, paragraphs will contain a list of
        # related paragraphs.