Question

“单词列表”中的单词和我正在搜索的文本都是西里尔语。文本以UTF-8编码（在Notepad ++中设置）。我需要Python来匹配文本中的单词并获取单词之后的所有内容，直到完全停止后跟新行。

修改

with open('C:\....txt', 'rb') as f:
    wordslist = []
    for line in f:
        wordslist.append(line) 

wordslist = map(str.strip, wordslist)

/ EDIT

for i in wordslist:
    print i #so far, so good, I get Cyrillic
    wantedtext = re.findall(i+".*\.\r\n", open('C:\....txt', 'rb').read())
    wantedtext = str(wantedtext)
    print wantedtext

“Wantedtext”显示并保存为“\ xd0 \ xb2”（等）。

我尝试了什么：

这个问题不同，因为没有涉及变量： Convert bytes to a python string。此外，所选答案的解决方案

wantedtext.decode('utf-8')

没有用，结果是一样的。来自here的解决方案也无济于事。

编辑：修改代码，返回“[]”。

with io.open('C:....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines() 

for i in wordslist:
    print i
    with io.open('C:....txt', 'r', encoding='utf-8') as my_file:
        my_file_test = my_file.read()
        print my_file_test #works, prints cyrillic characters, but...


        wantedtext = re.findall(i+".*\.\r\n", my_file_test)
        wantedtext = str(wantedtext)

        print wantedtext #returns []

（在下面的评论后添加：如果从正则表达式中删除\ r \ n，此代码有效。）

Answer 1

仅限Python 2.x

您的find可能无效，因为您正在混合strs和Unicodes strs，或者包含不同编码的strs。如果您不知道Unicode str和str之间的区别，请参阅：https://stackoverflow.com/a/35444608/1554386

除非你知道自己在做什么，否则不要开始decoding。它不是伏都教：）

您需要先将所有文本转换为Unicode对象。

将您的阅读拆分为一个单独的行 - 它更容易阅读
解码您的文本文件。使用支持Python 3解码的io.open()。我假设你的文本文件是UTF-8（我们很快就会发现它是不是）：
```
with io.open('C:\....txt', 'r', encoding='utf-8') as my_file:
    my_file_test = my_file.read()
```
my_file_test现在是Unicode str

现在你可以这样做：

# finds lines beginning with i, ending in .
regex = u'^{i}*?\.$'.format(i=i)
wantedtext = re.findall(regex, my_file_test, re.M)

看看wordslist。你不会说你用它做什么，但你需要确保它也是一个Unicode str。如果您从文件中读取，请使用上面的io.open。

编辑：

对于wordslist，您可以将文件解码并读取到列表中，同时一次性删除换行符：

with io.open('C:\....txt', 'r', encoding='utf-8') as f:
    wordslist = f.read().splitlines()

在Python 2.7中使用变量进行正则表达式搜索返回字节而不是解码文本

1 个答案:

仅限Python 2.x