Question

我在使用scrapy刮取的数据中删除不需要的字符时遇到了一些困难。

示例数据：

＆＃39; Premium Terraced Studio＆＃39;，＆＃39; 51周（09/09/2017 - 31/08/2018）房间 -   租赁＆＃39;，＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;，＆＃39; Premium Plus Terraced   工作室＆＃39;，＆＃39; 51周（09/09/2017 - 31/08/2018）房间 - 租赁＆＃39;，＆＃39;＆＃39;，

  ＆＃39;＆＃39;，＆＃39;＆＃39;，＆＃39;＆＃39;，

它很麻烦并且有了新的线条，但我使用了它，这种方式清理了它：

[s.strip() for s in response.xpath('//div/div/table/tbody/tr/td/div/text()').extract()]

我也尝试了这个，但没有多大帮助：

[s.strip("''\n") for s in response.xpath('//div/div/table/tbody/tr/td/div/text()').extract()]

任何想法都会被学徒化！

Answer 1

您可以将Unit item field1_1 field1_2 field1_3 1 apple test1 test2 null 2 ball apple1 test1 nul 3 cat ................... 4 dog.............. 5 elephant rat rat1 rat2与filter一起使用，即：

None

更新

我通常使用some_list = list(filter(None, response.xpath('//div/div/table/tbody/tr/td/div/text()').extract()))来解析lxml，这是一个可以帮助您的示例：

html

输出：

import requests
from lxml import etree

my_url = 'https://www.collegiate-ac.com/uk-student-accommodation/glasgow/claremont-house/rooms-rent'
html = requests.get(my_url, allow_redirects=True).text
tree = etree.HTML(html)
divs = tree.xpath("//div[@class='lease-type']/text()")
for div_text in divs:
    print div_text

使用scrapy python进行刮擦时使用条带

1 个答案: