请检查站点:
https://www.americanberkshire.com/california.html
p标签中都有
我想与每个元素分开,但是我可以找到有效的方法
# -*- coding: utf-8 -*-
import scrapy
class AmericanberkshireSpider(scrapy.Spider):
name = 'americanberkshire'
allowed_domains = ['americanberkshire.com']
start_urls = ['https://www.americanberkshire.com/california.html']
def parse(self, response):
lists=
答案 0 :(得分:2)
也许如果您使用xpath 2.0,则可以在选择器中使用正则表达式,例如//p[matches(text(),'[\w\s]+\([\w+]\)','i')]
。
或者尝试像这样进行迭代(不完全是代码,仅是示例):
for sel in response.css('p'):
txt = sel.css('::text').get()
if not txt or not re.match('[\w\s]+\([\w+]\)', txt):
continue
# do what you need with selector sel
答案 1 :(得分:1)
def parse(self, response):
for red_paragraph in response.xpath('//p[re:test(text(), "\([A-Z]{3,}\)")]'):
paragraphs = [red_paragraph]
for paragraph in red_paragraph.xpath('./following-sibling::p'):
if not paragraph.xpath('string(.)').extract_first().strip():
break
paragraphs.append(paragraph)
# In each iteration reaching here, paragraphs will contain a list of
# related paragraphs.