Question

我正在使用蜘蛛收集网站上事件的一些信息。我使用的是css选择器，而不是xpath，但是删除空格时遇到问题。

我尝试过Xpath，但是我觉得我做错了。我只使用CSS.selector成功

def parse(self, response):

    items = AiaaeventsItem()

    title = response.css('.item-list__title::text').extract()

    date = response.xpath('.//p[@class="item-list__date"]/text()').extract()

'title'：['\ n'， '\ n' '2019 AAS / AIAA天体动力学专家会议\ n' '' '\ n'， '\ n' '2019区域领导会议\ n' '' '\ n'，

{'date'：['\ n 2019年8月11日至8月15日\ n'， '\ n 2019年8月18日\ n'， '\ n 2019年8月19日至8月22日\ n'， '\ n 2019年8月22日至8月24日\ n'，

Answer 1

仅作一般说明：Scrapy建议现在使用.get（）和.get_all（）。 https://docs.scrapy.org/en/latest/topics/selectors.html#extract-and-extract-first

清理导出的测试的通用解决方案是使用Scrapy输出处理器。 https://doc.scrapy.org/en/latest/topics/loaders.html#declaring-input-and-output-processors的概述很好。 Cleaning data scraped using Scrapy有点像一个相关的答案。

话虽这么说，如果您只想清理有限数量的提取文本，而要使用完整的输出处理器规则则工作量太大，那么我将遍历输出，然后仅调用strip（）或replace（）字符串。 Python列表理解对此非常有用。

示例用法：

>>> title = ['\n ', '\n 2019 AAS/AIAA Astrodynamics Specialist Conference\n ', '\n ', '\n 2019 Regional Leadership Conference\n ', '\n ']
>>> date = ['\n 11 August - 15 August 2019\n ', '\n 18 August 2019\n ', '\n 19 August - 22 August 2019\n ', '\n 22 August - 24 August 2019\n ']

# Iterate over each item in title and print it as a list
>>> [x for x in title]
['\n ', '\n 2019 AAS/AIAA Astrodynamics Specialist Conference\n ', '\n ', '\n 2019 Regional Leadership Conference\n ', '\n ']

# Iterate over each item but actually run strip() on the string.
>>> [x.strip() for x in title]
['', '2019 AAS/AIAA Astrodynamics Specialist Conference', '', '2019 Regional Leadership Conference', '']

# Same, but skip empty results
>>> [x.strip() for x in title if len(x.strip())]
['2019 AAS/AIAA Astrodynamics Specialist Conference', '2019 Regional Leadership Conference']

# Same for the date results
>>> [x.strip() for x in date if len(x.strip())]
['11 August - 15 August 2019', '18 August 2019', '19 August - 22 August 2019', '22 August - 24 August 2019']

我正在使用Scrapy蜘蛛和CSS选择器，但无法删除空白。你能帮助我吗？

1 个答案: