我正在使用.rsplit()在最后一个逗号之后使用更多逗号分隔字符串中的所有数字。转换应该是这样的:
在:
,000
后:
,0,0,0
我使用以下方法执行此操作:
upl = line.rsplit(",",1)[1:]
upl2 = "{}".format(",".join(list(upl[0])))
作为比较,为了确保选择正确的子字符串,我也使用这个语句:
upl1 = "{}".format("".join(list(upl[0])))
然后我打印两者以确保它们都符合预期。在这个例子中,我得到:
up1 = ,000
up2 = ,0,0,0,
然后我使用.replace()语句将我的前一个子字符串替换为my after:
new_var = ''
for line in new_var.split("\n"):
upl = line.rsplit(",",1)[1:]
upl1 = "{}".format("".join(list(upl[0])))
upl2 = "{}".format(",".join(list(upl[0])))
upl2 = str(upl2)
upl1 = str(upl1)
new_var += line.replace(upl1, upl2) + '\n'
在几乎所有解析数据的实例中,旧的子字符串都会被新的正确覆盖。但是在一些字符串中的子窗格将显示为:
,0,00 when it should be ,0,0,0,
任何人都可以看到任何明显的原因,为什么这可能是因为我有点不知所措。
由于
修改
这是我用来生成我正在操作的数据的Scrapy代码。问题来自一行:
new_match3g += line.replace(spl1, spl2).replace(tpl1, tpl2).replace(upl1, upl2) + '\n'
完整的代码是:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
class ExampleSpider(CrawlSpider):
name = "mrcrawl2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 5
rules = [Rule(SgmlLinkExtractor(allow=('/Seasons'),deny=('/News', '/Fixtures', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', '/Players', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]
def parse_item(self, response):
sel = Selector(response)
regex = re.compile('DataStore\.prime\(\'history\', { stageId: \d+ },\[\[.*?\]\]?\)?;', re.S)
match2g = re.search(regex, response.body)
if match2g is not None:
match3g = match2g.group()
match3g = str(match3g)
match3g = match3g.replace("'", '').replace("'", '').replace('[', '').replace(']', '').replace('] );', '')
match3g = re.sub("DataStore\.prime\(history, { stageId: \d+ },", '', match3g)
match3g = match3g.replace(');', '')
#print'-' * 170, '\n', match3g.decode('utf-8'), '-' * 170, '\n'
new_match3g = ''
for line in match3g.split("\n"):
upl = line.rsplit(",",1)[1:]
if upl:
upl1 = "{}".format("".join(list(upl[0])))
upl2 = "{}".format(",".join(list(upl[0])))
upl2 = str(upl2)
upl1 = str(upl1)
new_match3g += line.replace(upl1, upl2) + '\n'
print "UPL1 = ", upl1
print "UPL2 = ", upl2
print'-' * 170, '\n', new_match3g.decode('utf-8'), '-' * 170, '\n'
print'-' * 170, '\n', match3g.decode('utf-8'), '-' * 170, '\n'
execute(['scrapy','crawl','mrcrawl2'])
答案 0 :(得分:2)
由于您已经为我们提供了一个示例,请让我们通过以下方式进行跟踪:
>>> line = ',9243,46,Unterhaching,2,11333,8,13,1,133'
>>> split = line.rsplit(",",1)
>>> split
[',9243,46,Unterhaching,2,11333,8,13,1', '133']
>>> upl = split[1:]
>>> upl
['133']
>>> upl0 = upl[0]
>>> upl0
'133'
>>> upl0_list = list(upl0)
>>> upl0_list
['1', '3', '3']
>>> joined1 = "".join(upl0_list)
>>> joined1
'133'
>>> upl1 = "{}".format(joined1)
>>> upl1
'133'
>>> joined2 = ",".join(upl0_list)
>>> joined2
'1,3,3'
>>> upl2 = "{}".format(joined2)
>>> upl2
'1,3,3'
>>> upl2 = str(upl2)
>>> upl2
'1,3,3'
>>> upl1 = str(upl1)
>>> upl1
'133'
>>> r = line.replace(upl1, upl2)
>>> r
',9243,46,Unterhaching,2,11,3,33,8,13,1,1,3,3'
再次注意,超过一半的步骤实际上并没有做任何事情。您将字符串转换为相同的字符串,然后再将它们转换为相同的字符串;您将它们转换为列表只是为了将它们重新组合在一起;如果你不能解释每个步骤应该做什么,你为什么要这样做?你的代码应该是计算机做某事的指示;只是给它随机的指示,你不明白它不会有任何好处。
更重要的是,这不是您描述的输出。它与您描述的问题有不同的问题:除了使用133
正确替换最后的1,3,3
之外,它还 使用133
替换11333
中间的嵌入式11,3,33
。因为这正是你要求它做的事情。
所以,假设这是你的实际问题,而不是你问的问题,你如何解决这个问题?
嗯,你不是。您不希望用'133'
替换每个'1,3,3'
子字符串,因此请不要让它这样做。你想要创建一个字符串,其中包含最后一个逗号的所有内容,然后是最后一个逗号后所有内容的已处理版本。换句话说:
>>> ",".join([split[0], upl2])
',9243,46,Unterhaching,2,11333,8,13,1,1,3,3'
答案 1 :(得分:1)
我这样做:
>>> ",000".replace("", ",")[2:]
',0,0,0,'