Question

我有一个由

的utf-8文件生成的查找列表

@foreach ($packages as $package)
    ...
    @foreach ($package->courses as $course)
        ...
    @endforeach
@endforeach

当我打开文件时，我看到“الو”这个词就在那里。所以它在列表中，但列表现在看起来像 ['\ xd8 \ xa7 \ xd9 \ x84 \ xd9 \ x88'，'\ xd8 \ xa3 \ xd9 \ x84 \ xd9 \ x88'，'\ xd8 \ xa7 \ xd9 \ x88 \ xd9 \ x83 \ xd9 \ x8a'， '\ xd8 \ xa7 \ xd9 \ x84'，'\ xd8 \ xa7 \ xd9 \ x87'，'\ xd8 \ xa3 \ xd9 \ x87'，'\ xd9 \ x87 \ xd9 \ x84 \ xd9 \ x88'，'\ xd8 \ xa3 \ xd9 \ x88 \ xd9 \ x83 \ xd9 \ x8a'，'\ xd9 \ x88']

然后我想搜索newStopWords1d中是否有特定的单词 'الو'这个词是'\ xd8 \ xa7 \ xd9 \ x84 \ xd9 \ x88'

with open('stop_word_Tiba.txt') as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords)) # convert 2d list to 1d list

找不到这个词，我试过了

word='الو'
for w in newStopWords1d:
    if word == w.encode("utf-8"):
        print 'found'

但又没有看到这个词。这似乎是编码的问题，但我无法解决它。能帮帮我吗。

Answer 1

值得一提的是你使用Python 2.7。

word='الو'
for w in newStopWords1d:
    if word == w.decode("utf-8"):
        print 'found'

更好的解决方案是使用io

中的open函数

import io

with io.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

或codecs模块

import codecs

with codecs.open('stop_word_Tiba.txt', encoding="utf-8") as f:
    ...

因为Python 2.7中的内置开放函数不支持指定编码。

Answer 2

通过将打开的文件语句编辑为

解决了问题

with codecs.open("stop_word_Tiba.txt", "r", "utf-8") as f:
    newStopWords= list(itertools.chain( line.split() for line in f)) #save the file as list of lines
newStopWords1d=list(itertools.chain(*newStopWords))
    for w in newStopWords1d:
            if word.encode("utf-8") == w.encode("utf-8") :  
                      return 'found'

谢谢你...

utf-8搜索列表中的单词

2 个答案: