Question

我有一个中文字典条目列表（基于cc-cedict），其中包含以下格式的中文和拉丁字符混合，由换行符分隔：

（的Source.txt）

traditional_chars simplified_chars，拼音，定义

山墙山墙，shan1 qiang2，山墙

B型超声B型超声，B xing2 chao1 sheng1，B型超声

我想在传统字符和简体字符之间添加逗号：

（期望的结果）

山墙，山墙，shan1 qiang2，山墙

B型超声，B型超声，B xing2 chao1 sheng1，B型超声

在regex101进行了一些实验后，我想出了这种模式：

[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,

我尝试使用以下代码在Python中应用此模式：

import re
sourcepath = 'sourcefile.txt'
destpath = 'result.txt'
pattern = '[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,'

source = open(sourcepath, 'r').read()
dest = open(destpath, 'w')
result = re.sub(pattern, ',', source)
dest.write(result)
dest.close()

但是当我打开result.txt时，我得到的结果并不是我的预期：

，shan1 qiang2，山墙

，B xing2 chao1 sheng1，B型超声

我也尝试过使用这种模式的regexp模块：

[A-z]*\p{Han}(\s)[A-z]*\p{Han}

但结果是一样的。

我认为通过将\ s字符放在括号中，它会构成一个捕获组，只有那个空格才会被替换。但看起来汉字也被替换了。我在正则表达式，代码或两者中都犯了错误吗？我应该如何更改它以获得所需的结果？

Answer 1

如果你有中国“单词”的奇数，你的模式应该考虑重叠匹配。使用前瞻：

re.sub(r'(?i)[A-Z]*[\u4300-\u9fff]+(?=\s+[A-Z]*[\u4300-\u9fff]+)', r'\g<0>,', source)
                                   ^^^                         ^

或者使用原子组模拟捕获正面前瞻结合消费模式中的反向引用，并检查是否已存在逗号：

re.sub(r'(?i)[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', r'\g<0>,', source) 
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

请参阅regex demo（和demo 2） - 不要注意\x{}表示法，因为我使用的是PHP选项，因此仅用于演示。）

请参阅IDEONE Python 3 demo：

import re
p = re.compile(r'[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', re.IGNORECASE | re.U)
test_str = "山牆 山墙,shan1 qiang2,gable\nB型超聲 B型超声, B xing2 chao1 sheng1,type-B ultrasound"
result = p.sub(r"\g<0>,", test_str)
print(result)
# => 山牆, 山墙,shan1 qiang2,gable
# => B型超聲, B型超声, B xing2 chao1 sheng1,type-B ultrasound

Answer 2

我认为通过将\ s字符放在括号中，它会构成一个捕获组，只有那个空格才会被替换。

这不是捕捉小组的工作方式。匹配的所有内容仍然会被替换，但是使用捕获组，您可以参考替换中匹配的内容。

我会改变你的两行剧本：

pattern = '(?i)([a-z]*[\u4300-\u9fff]+)\s([a-z]*[\u4300-\u9fff]+)'

和

result = re.sub(pattern, '\g<0>,\g<1>', source)

Answer 3

使用示例代码在<head> <title>Test</title> <script> var links = ["http://www.example.com","http://www.example.com","http://www.example.com"]; var images = [ "https://upload.wikimedia.org/wikipedia/commons/a/a9/Example.jpg", "https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png", "https://upload.wikimedia.org/wikipedia/commons/2/29/Example_image_not_be_used_in_article_namespace.jpg", "https://upload.wikimedia.org/wikipedia/commons/c/ce/Example_image.png", "https://upload.wikimedia.org/wikipedia/commons/9/90/Contoh.jpg", "https://upload.wikimedia.org/wikipedia/commons/e/e2/P%C5%99%C3%ADklad.jpg" ]; var i = 0; var renew = setInterval(function(){ if(links.length == i){ i = 0; } else { document.getElementById("bannerImage").src = images[i]; document.getElementById("bannerLink").href = links[i]; i+=1; } },3000); </script> </head> <a id="bannerLink" href="http://www.example.com" onclick="void window.open(this.href); return false;"> <img id="bannerImage" src="https://upload.wikimedia.org/wikipedia/commons/a/a9/Example.jpg" alt="some text"> </a></br> <a id="bannerLink" href="http://www.example.com" onclick="void window.open(this.href); return false;"> <img id="bannerImage" src="https://upload.wikimedia.org/wikipedia/commons/7/70/Example.png" alt="some text"> </a></br> <a id="bannerLink" href="http://www.example.com" onclick="void window.open(this.href); return false;"> <img id="bannerImage" src="https://upload.wikimedia.org/wikipedia/commons/2/29/Example_image_not_be_used_in_article_namespace.jpg" alt="some text"> </a></br> </body>上进行测试：

Python 3.5

正则表达式解释

result = re.sub(r"([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)", r"\1,\2", subject, 0, re.IGNORECASE)

Python正则表达式意外地替换了中文字符

3 个答案: