如何从python中的scrapy输出中删除'\ n'

时间:2015-07-22 05:50:21

标签: python regex web-scraping scrapy

我正在尝试输出到CSV,但我意识到在刮取tripadvisor时我得到了很多回车因此数组超过30而只有10条评论所以我得到了很多字段。有没有办法删除回车。

蜘蛛。

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import json
from scrapy.selector.lxmlsel import HtmlXPathSelector
import csv
import html2text
import unicodedata


class scrapingtestspider(Spider):
    name = "scrapytesting"
    allowed_domains = ["tripadvisor.in"]
    base_uri = ["tripadvisor.in"]
    start_urls = [
        "http://www.tripadvisor.in/Hotel_Review-g297679-d736080-Reviews-Ooty_Elk_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"]



    def parse(self, response):
        item = ScrapingTestingItem()
        sel = HtmlXPathSelector(response)
        converter = html2text.HTML2Text()
        sites = sel.xpath('//a[contains(text(), "Next")]/@href').extract()
##        dummy_test = [ "" for k in range(10)]

        item['reviews'] = sel.xpath('//div[@class="col2of2"]//p[@class="partial_entry"]/text()').extract()
        item['subjects'] = sel.xpath('//span[@class="noQuotes"]/text()').extract()
        item['stars'] = sel.xpath('//*[@class="rating reviewItemInline"]//img/@alt').extract()
        item['names'] = sel.xpath('//*[@class="username mo"]/span/text()').extract()
        item['location'] = sel.xpath('//*[@class="location"]/text()').extract()
        item['date'] = sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract()
        item['date'] += sel.xpath('//div[@class="col2of2"]//span[@class="ratingDate"]/text()').extract()


        startingrange = len(sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract())

        for j in range(startingrange,len(item['date'])):
            item['date'][j] = item['date'][j][9:].strip()

        for i in range(len(item['stars'])):
            item['stars'][i] = item['stars'][i][:1].strip()

        for o in range(len(item['reviews'])):
            print unicodedata.normalize('NFKD', unicode(item['reviews'][o])).encode('ascii', 'ignore')

        for y in range(len(item['subjects'])):
            item['subjects'][y] = unicodedata.normalize('NFKD', unicode(item['subjects'][y])).encode('ascii', 'ignore')

        yield item

#        print item['reviews']

        if(sites and len(sites) > 0):
            for site in sites:
                yield Request(url="http://tripadvisor.in" + site, callback=self.parse)        

是否有可能使用正则表达式来完成for循环并替换它。我试过替换,但没有做任何事情。还有为什么scrapy会这样做。

2 个答案:

答案 0 :(得分:3)

我通常用来修剪和清理输出的是使用Input and/or Output ProcessorsItem Loaders - 它使事情更加模块化和干净:

class ScrapingTestingLoader(ItemLoader):
    default_input_processor = MapCompose(unicode.strip)
    default_output_processor = TakeFirst()

然后,如果您要使用此项目加载器来加载项目,您将获取剥离的提取值和字符串(而不是列表)。例如,如果提取的字段为["my value \n"] - 您将获得my value作为输出。

答案 1 :(得分:1)

阅读列表文档后的简单解决方案。

while "\n" in some_list: some_list.remove("\n")