在逗号分隔的字符串中间替换下划线分隔的子字符串

时间:2013-09-16 14:16:39

标签: python regex string replace

我有一个包含多行的文件,如下所示:

 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}

我想用另一个号码替换1371078139195(在这种情况下)。 我想要替换的值总是在第一个逗号分隔的单词中,并且始终是该单词中的第二个下划线分隔值。 以下是我这样做的方式并且它有效,但这似乎不合时宜且笨拙。

>>> line="'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> l1=",".join(line.split(",")[1:])
>>> print l1
 {'cf:rv': '0'}
>>> l2=line.split(",")[0]
>>> print l2
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442'
>>> print "_".join(l2.split('_')[:-2])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight
>>>
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442'
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1]) + "," + l1
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}
>>>

是否有更简单的方法来替换(可能使用正则表达式)值?我无法想象这是最好的方式

我有几个答案,我必须强调它是第二个强调的价值。以下是有效的字符串:

line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"

在上面的例子中,字符串中有一个数字字符串,它不在第二个最后一个下划线之后。最后一部分可能是也可能不是所有数字(可能是+14155186442,也可能是14155186442)。对不起,我上面没有提到这一点。

A

5 个答案:

答案 0 :(得分:4)

使用正则表达式:

m = re.match("([^,]*_)([+]?[0-9]+)(_.*)", s)
if m:
    before = m.group(1)
    number = m.group(2)
    after = m.group(3)
    s = before + new_number(number) + after

意思是

  • [^,]*_ =您想要多少个字符,但不是逗号,后跟下划线
  • [+]?[0-9]+ =数字,可选地以+
  • 开头
  • _.* =一个下划线,后跟任何

这是有效的,因为regexp匹配默认为“贪婪”,因此[^,]*实际上将使用所有下划线,在倒数第二个之前停止以使匹配成功。

例如,如果您需要代替倒数第二个下划线,则需要第三个最后一个表达式可以更改为

m = re.match("([^,]*_)([+]?[0-9]+)(_[^,]*_.*)", s)

因此要求在数字之后在逗号之前至少有两个下划线。

答案 1 :(得分:3)

非正则表达式解决方案:

>>> strs = " 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> first, sep, rest = strs.partition(',')
>>> lis = first.rsplit('_', 2)
>>> lis[1] = "1111111"
>>> "_".join(lis) + sep + rest
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111111_+14155186442', {'cf:rv': '0'}"

<强>功能:

def solve(strs, rep):                                                                                                   first, sep, rest = strs.partition(',')
    lis = first.rsplit('_', 2)
    lis[1] = rep
    return "_".join(lis) + sep + rest
... 
>>> solve(" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}", "1111")
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111_+14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_2222_14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_2222_1371078139195', {'cf:rv': '0'}"

答案 2 :(得分:1)

喜欢这个吗?

>>> line = "'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> re.subn('_(\d+)_', '_mynewnumber_', line, count=1) 
("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_mynewnumber_+14155186442', {'cf:rv': '0'}",
1)

答案 3 :(得分:0)

import re

r = re.compile('([^,]*_)(\d+)(?=_[^_,]+,)(_.*)')

for line in ("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
             "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"):
    print line
    print r.sub('\\1ABCDEFG\\3',line)
    print r.sub('\g<1>1234567\\3',line)

结果

'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

\g<1>表示“第1组”。 请参阅doc:

  

除了描述的字符转义和反向引用   在上面,\ g将使用由名为group的组匹配的子字符串   name,由(?P ...)语法定义。 \ g使用​​了   相应的组号; \克LT 2 - ;因此相当于\ 2,但是   在诸如\ g&lt; 2&gt; 0的替换中不是模糊的。 \ 20会   解释为对第20组的引用,而不是对第2组的引用   后跟字面字符'0'。反向引用\ g&lt; 0&gt;   在RE匹配的整个子字符串中替换。

答案 4 :(得分:0)

不像正则表达式那样复杂,但在将来编码,理解,调试和更改相对简单。除了分隔符之外,它不会假设哪些字母构成“单词”。

def replace_term(line, replacement):
    csep = line.split(',')
    usep = csep[0].split('_')
    return ','.join(['_'.join(usep[:-2] + [replacement] + usep[-1:])] + csep[1:])

lines = ["'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"]

for line in lines:
    print replace_term(line, 'XXX')

输出:

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_XXX_1371078139195', {'cf:rv': '0'}