Question

我有一系列主要是英文的文字，但包含一些带有汉字的短语。这是两个例子：

s1 = "You say: 你好. I say: 再見"
s2 = "答案, my friend, 在風在吹"

我正在尝试找到每个中文块，应用一个将翻译文本的函数（我已经有办法进行翻译），然后替换字符串中的翻译文本。所以输出将是这样的：

o1 = "You say: hello. I say: goodbye"
o2 = "The answer, my friend, is blowing in the wind"

我可以通过这样轻松找到汉字：

utf_line = s1.decode('utf-8') 
re.findall(ur'[\u4e00-\u9fff]+',utf_line)

...但我最终得到了所有汉字的列表，无法确定每个短语的开始和结束位置。

Answer 1

你总是可以在python中使用re.sub()来使用匹配的正则表达式的就地替换。

试试这个：

print(re.sub(r'([\u4e00-\u9fff]+)', translate('\g<0>'), utf_line))

Answer 2

您无法使用re.findall()获取索引。您可以改为使用re.finditer()，并参考m.group()，m.start()和m.end()。

但是，对于您的特定情况，使用re.sub()调用函数似乎更实用。

如果 repl 是一个函数，则会为 pattern 的每次非重叠事件调用它。该函数接受单个匹配对象参数，并返回替换字符串

<强>代码：

import re

s = "You say: 你好. I say: 再見. 答案, my friend, 在風在吹"
utf_line = s.decode('utf-8')

dict = {"你好" : "hello",
        "再見" : "goodbye",
        "答案" : "The answer",
        "在風在吹" : "is blowing in the wind",
       }

def translate(m):
    block = m.group().encode('utf-8')
    # Do your translation here

    # this is just an example
    if block in dict:
        return dict[ block ]
    else:
        return "{unknown}"


utf_translated = re.sub(ur'[\u4e00-\u9fff]+', translate, utf_line, re.UNICODE)

print utf_translated.encode('utf-8')

<强>输出：

You say: hello. I say: goodbye. The answer, my friend, is blowing in the wind

Ideone demo

Answer 3

一种可能的解决方案是捕获所有内容，但是在不同的捕获组中，以便以后可以区分它们是否为中文。

ret = re.findall(ur'([\u4e00-\u9fff]+)|([^\u4e00-\u9fff]+)', utf_line)
result = []
for match in ret:
    if match[0]:
        result.append(translate(match[0]))
    else:
        result.append(match[1])

print(''.join(result))

Answer 4

正则表达式Match对象为您提供匹配的开始和结束索引。因此，不是findall，而是自己进行搜索并记录索引。然后，您可以翻译每个范围，并根据短语的已知索引替换字符串。

import re

_scan_chinese_re = re.compile(r'[\u4e00-\u9fff]+')

s1 = "You say: 你好. I say: 再見"
s2 = "答案, my friend, 在風在吹"

def translator(chinese_text):
    """My no good translator"""
    return ' '.join('??' for _ in chinese_text)

def scanner(text):
    """Scan text string, translate chinese and return copy"""
    print('----> text:', text)

    # list of extents where chinese text is found
    chinese_inserts = [] # [start, end]

    # keep scanning text to end
    index = 0
    while index < len(text):
        m = _scan_chinese_re.search(text[index:])
        if not m:
            break
        # get extent from match object and add to list
        start = index + m.start()
        end = index + m.end()
        print('chinese at index', start, text[start:end])
        chinese_inserts.append([start, end])
        index += end

    # copy string and replace backwards so we don't mess up indexes
    copy = list(text)
    while chinese_inserts:
        start, end = chinese_inserts.pop()
        copy[start:end] = translator(text[start:end])
    text = ''.join(copy)
    print('final', text)
    return text

scanner(s1)
scanner(s2)

使用我可疑的翻译器，结果是

----> text: You say: 你好. I say: 再見
chinese at index 9 你好
chinese at index 20 再見
final You say: ?? ??. I say: ?? ??
----> text: 答案, my friend, 在風在吹
chinese at index 0 答案
chinese at index 15 在風在吹
final ?? ??, my friend, ?? ?? ?? ??

Python：在字符串中查找一系列中文字符并应用函数

4 个答案: