Python-搜索中未包含第一个单词

时间:2018-07-24 02:04:38

标签: python python-3.6

为什么第一个单词正在打印但未包含在“ dic”的搜索中。

谁能告诉我也要包括第一个单词的原因和解决方法?

这是我的代码:

my_dic = {
"a":"1", 
"b":"2", 
"c":"3", 
"d":"4", 
"e":"5", 
}

with open('c:\\english_text_file.txt',encoding = 'utf8') as file :
  for line in file:
    for word in line.split():
      print('word from line.split: ',word)
      if word in my_dic.keys():
       print('word from if word in ...',word)

and the test file is here:

文本文件的内容是:

a b c d e

输出代码为:

word from line.split:  a
word from line.split:  b
word from if word in ... b
word from line.split:  c
word from if word in ... c
word from line.split:  d
word from if word in ... d
word from line.split:  e
word from if word in ... e

1 个答案:

答案 0 :(得分:2)

这是因为Windows的txt文件行为:它将PostgreSQL添加到txt文件的开头。

什么是BOM

表示BOM,值如下:

Byte-order mark Description

打开您的Byte-order mark Description EF BB BF UTF-8 FF FE UTF-16 aka UCS-2, little endian FE FF UTF-16 aka UCS-2, big endian 00 00 FF FE UTF-32 aka UCS-4, little endian. 00 00 FE FF UTF-32 aka UCS-4, big-endian. ,并使用任何十六进制编辑器进行查看,您将看到以下内容:

english_text_file.txt

这里,efbb bf61 2062 2063 2064 2065 0d0a是BOM,efbb bf61 2062 2063 2064 2065 0d0a的ASCII码

因此,对于utf-8文件,我们需要检查文件开头是否有a b c d e\r\n,如果有,则需要将其删除。

接下来是示例代码,供您参考,如果您不介意更改原始文件,也可以直接覆盖旧的输入文件,这里我只是使用其中没有BOM的新文件。

BOM

输出为:

import codecs

my_dic = {
    "a":"1",
    "b":"2",
    "c":"3",
    "d":"4",
    "e":"5",
}

content = open('./english_text_file.txt', 'rb').read()
if content[:3] == codecs.BOM_UTF8:
    content = content[3:]
    open('./changed_english_text_file.txt', 'wb').write(content)
else:
    open('./changed_english_text_file.txt', 'wb').write(content)

with open('./changed_english_text_file.txt',encoding = 'utf8') as file :
    for line in file:
        for word in line.split():
            print('word from line.split: ',word)
            if word in my_dic.keys():
                print('word from if word in ...',word)