Question

我正在Scrapy 1.0.3中编写一个蜘蛛，它会刮掉Unicode页面的存档并在页面的p标签内生成文本并将其转储到JSON文件中。我的代码如下所示：

  def parse(self,response):
    sel = Selector(response)
    list=response.xpath('//p[@class="articletext"]/font').extract()
    list0=response.xpath('//p[@class="titletext"]').extract()
    string = ''.join(list).encode('utf-8').strip('\r\t\n')
    string0 = ''.join(list0).encode('utf-8').strip('\r\t\n')
    fullstring = string0 + string
    stringjson=json.dumps(fullstring)

    with open('output.json', 'w') as f:
        f.write(stringjson)

    try:
        json.loads(stringjson)
        print("Valid JSON")
    except ValueError:
        print("Not valid JSON")

但是我得到了不需要的/ r / t / n字符序列，尽管使用了split（），我无法删除。为什么它不起作用，我将如何使它工作？

Answer 1

您将需要使用多种方法中的任何一种从Python中删除字符串中的字符。 strip()仅从开头和结尾删除空格。采用类似于您已经在做的方法：

string = ''.join(c for c in list if c not in '\r\t\n')
string0 = ''.join(c for c in list0 if c not in '\r\t\n')

您还可以在执行此操作之前将string和string0添加到一起，这样您就只需要执行一次。

编辑（回复评论）：

>>> test_string
'This\r\n \tis\t\t \t\t\t(only) a \r\ntest. \r\n\r\n\r\nCarry\t \ton'
>>> ''.join(c for c in test_string if c not in '\r\t\n')
'This is (only) a test. Carry on'

Answer 2

替代解决方案：xpath的“normalize-space”函数。

例如：

list=response.xpath('normalize-space(//p[@class="articletext"]/font)').extract()

而不是

list=response.xpath('//p[@class="articletext"]/font').extract()

normalize-space函数从字符串中去除前导和尾随空格，用空格替换空白字符序列，并返回结果字符串。

Answer 3

你的意思是“无法删除”？你有一个包含内容的字符串吗？删除它们非常简单：

str = "Test\r\n\twhatever\r\n\t"
str = str.replace("\r", '')
str = str.replace("\n", '')
str = str.replace("\t", '')

在scrapy响应中摆脱不需要的字符

3 个答案: