Question

我创建了一个简单的程序，它读取文件并要求用户输入一个单词，然后告诉该单词的使用次数。我想改进它，所以你不必每次都输入确切的目录。我导入了Tkinter并使用了代码fileName = filedialog.askfilename（），弹出一个框，让我选择文件。每次我尝试使用它虽然我得到以下错误代码...

Traceback (most recent call last):
  File "/Users/AshleyStallings/Documents/School Work/Computer Programming/Side Projects/How many? (Python).py", line 24, in <module>
    for line in fileScan.read().split():   #reads a line of the file and stores
  File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 12: ordinal not in range(128)

我尝试打开.txt文件时，我似乎没有得到此错误代码的唯一时间。但我也想打开.docx文件。感谢您的帮助：）

# Name: Ashley Stallings
# Program decription: Asks user to input a word to search for in a specified
# file and then tells how many times it's used.
from tkinter import filedialog

print ("Hello! Welcome to the 'How Many' program.")
fileName= filedialog.askopenfilename()  #Gets file name


cont = "Yes"

while cont == "Yes":
    word=input("Please enter the word you would like to scan for. ") #Asks for word
    capitalized= word.capitalize()  
    lowercase= word.lower()
    accumulator = 0

    print ("\n")
    print ("\n")        #making it pretty
    print ("Searching...")

    fileScan= open(fileName, 'r')  #Opens file

    for line in fileScan.read().split():   #reads a line of the file and stores
        line=line.rstrip("\n")
        if line == capitalized or line == lowercase:
            accumulator += 1
    fileScan.close

    print ("The word", word, "is in the file", accumulator, "times.")

    cont = input ('Type "Yes" to check for another word or \
"No" to quit. ')  #deciding next step
    cont = cont.capitalize()

    if cont != "No" and cont != "Yes":
        print ("Invalid input!")

print ("\n")
print ("Thanks for using How Many!")  #ending

P.S。不确定是否重要，但我正在运行OSx

Answer 1

我尝试打开.txt文件时，我似乎没有得到此错误代码的唯一时间。但我也想打开.docx文件。

docx文件不仅仅是一个文本文件;它是一个Office Open XML文件：一个包含XML文档和任何其他支持文件的zipfile。试图将其作为文本文件阅读是行不通的。

例如，文件的前4个字节将是：

b'PK\x03\x04`

你不能把它解释为UTF-8，ASCII或其他任何东西而不会得到一堆垃圾。你肯定不会在这里找到你的话。

您可以自己使用zipfile进行一些处理以访问存档中的document.xml，然后使用XML解析器获取文本节点，然后重新加入它们以便您可以拆分它们在空白上。例如：

import itertools
import zipfile
import xml.etree.ElementTree as ET

with zipfile.ZipFile('foo.docx') as z:
    document = z.open('word/document.xml')
    tree = ET.parse(document)

textnodes = tree.findall('.//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t')
text = itertools.chain.from_iterable(node.text.split() for node in textnodes)
for word in text:
    # ...

当然最好实际解析xmlns声明并正确注册w命名空间，这样你就可以使用'w:t'，但如果你知道这意味着什么，你已经知道了，如果你不知道，这不是关于XML命名空间和ElementTree的教程的地方。

那么，你怎么知道它是一个充满东西的zip文件，实际文本在文件word/document.xml中，该文件中的实际文本在.//w:t个节点中，并且命名空间w映射到http://schemas.openxmlformats.org/wordprocessingml/2006/main，依此类推？好吧，你可以阅读所有相关的文档，并使用一些示例文件和一些探索来指导你，如果你已经足够了解这些东西。但如果你不这样做，就会有一个重要的学习曲线。

即使你确实知道自己在做什么，对search PyPI for a docx parser module来说也许更好一点，并且只是使用它。

尝试使用Tkinter时的Unicode解码错误（Python）

1 个答案: