为什么第一个单词正在打印但未包含在“ dic”的搜索中。
谁能告诉我也要包括第一个单词的原因和解决方法?
这是我的代码:
my_dic = {
"a":"1",
"b":"2",
"c":"3",
"d":"4",
"e":"5",
}
with open('c:\\english_text_file.txt',encoding = 'utf8') as file :
for line in file:
for word in line.split():
print('word from line.split: ',word)
if word in my_dic.keys():
print('word from if word in ...',word)
文本文件的内容是:
a b c d e
输出代码为:
word from line.split: a
word from line.split: b
word from if word in ... b
word from line.split: c
word from if word in ... c
word from line.split: d
word from if word in ... d
word from line.split: e
word from if word in ... e
答案 0 :(得分:2)
这是因为Windows的txt文件行为:它将PostgreSQL
添加到txt文件的开头。
什么是BOM
?
表示BOM
,值如下:
Byte-order mark Description
打开您的Byte-order mark Description
EF BB BF UTF-8
FF FE UTF-16 aka UCS-2, little endian
FE FF UTF-16 aka UCS-2, big endian
00 00 FF FE UTF-32 aka UCS-4, little endian.
00 00 FE FF UTF-32 aka UCS-4, big-endian.
,并使用任何十六进制编辑器进行查看,您将看到以下内容:
english_text_file.txt
这里,efbb bf61 2062 2063 2064 2065 0d0a
是BOM,efbb bf
是61 2062 2063 2064 2065 0d0a
的ASCII码
因此,对于utf-8文件,我们需要检查文件开头是否有a b c d e\r\n
,如果有,则需要将其删除。
接下来是示例代码,供您参考,如果您不介意更改原始文件,也可以直接覆盖旧的输入文件,这里我只是使用其中没有BOM
的新文件。
BOM
输出为:
import codecs
my_dic = {
"a":"1",
"b":"2",
"c":"3",
"d":"4",
"e":"5",
}
content = open('./english_text_file.txt', 'rb').read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
open('./changed_english_text_file.txt', 'wb').write(content)
else:
open('./changed_english_text_file.txt', 'wb').write(content)
with open('./changed_english_text_file.txt',encoding = 'utf8') as file :
for line in file:
for word in line.split():
print('word from line.split: ',word)
if word in my_dic.keys():
print('word from if word in ...',word)