我一直在尝试读取.docx文件并将其文本复制到.txt文件
为了实现上述结果,我首先编写了这段脚本。
if extension == 'docx' :
document = Document(filepath)
for para in document.paragraphs:
with open("C:/Users/prasu/Desktop/PySumm-resource/CodeSamples/output.txt","w") as file:
file.writelines(para.text)
发生的错误如下:
Traceback (most recent call last):
File "input_script.py", line 27, in <module>
file.writelines(para.text)
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in
position 0: character maps to <undefined>
我尝试在print()的帮助下打印“ para.text”,它可以工作。 现在,我想将“ para.text”写入.txt文件。
答案 0 :(得分:0)
您可以尝试使用 codecs。
根据您的错误消息,似乎是以下字符“≥”引起了问题。使用编解码器以 utf-8 输出应该有望解决您的问题。
from docx import Document
import codecs
filepath = r"test.docx"
document = Document(filepath)
for para in document.paragraphs:
with codecs.open('output.txt', 'a', "utf-8-sig") as o_file:
o_file.write(para.text)
o_file.close()