Question

我有一个如下所示的数据集：

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

我想更改列表，所以看起来像这样：

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

每个名称都必须更改为他们所属的类。我注意到在数据集中，列表中的每个新类都用'###'表示。所以我可以通过'###'将数据集拆分成块，并计算###的实例。然后使用正则表达式查找名称，并将其替换为###的计数。

我的代码如下所示：

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

这似乎不起作用 - 没有替换。

Answer 1

运行您提供的代码时，我得到了以下回溯输出：

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

发生错误是因为type(match)评估列表。当我在PDB中检查此列表时，它是一个空列表。这是因为match通过两个for循环超出了范围。所以让我们把它们结合起来：

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

现在您正在match中获取内容，但仍然存在问题：re.findall的返回类型是字符串列表。 str.replace(...)期望将一个字符串作为其第一个参数。

您可以作弊，并将违规行更改为print(line.replace(match[0], prefix + str(triple_hash_count))) - 但这假设您确定要在不是###的每一行上找到正则表达式匹配。更有弹性的方法是在尝试呼叫str.replace()之前检查是否有匹配。

最终代码如下：

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

还有两件事：

在第11行，您误认为变量名称。它是triple_hash_count，而不是hash_count。
此代码实际上不会更改在第1行作为输入提供的文本文件。您需要将line.replace(match, prefix + str(triple_hash_count))的结果写回文件，而不仅仅是打印它。

Answer 2

问题源于使用第二个循环（以及错误命名的变量）。这将有效。

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count))

Answer 3

虽然你已经有了答案，但你可以用正则表达式（它甚至可以是单行，但这不是非常易读）来完成它：

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

它的作用：

首先，它会在###模式中以^和$为锚点在一行中查找MULTILINE。
其次，它会在Name之后查找可能的数字，将其捕获到第1组（但由于并非所有名称都具有可选项，因此可以选择）。
第三，它将您的字符串拆分为###并使用enumerate()进行迭代，从而为要插入的数字设置计数器。
最后，它再次按###加入结果列表。

作为单行（虽然不可取）：

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

演示

A demo 说了数千字。

如何在python中使用正则表达式替换模式？

3 个答案:

它的作用：

作为单行（虽然不可取）：

演示