Question

我有一百多个需要对NLP进行预处理的非结构化.txt文件（文章），我必须首先将.txt文件转换为.csv文件吗？还是可以开始使用原始文本文件进行清理？如果是这样，有人可以帮助我使用Python进行批处理文件类型转换吗？

Answer 1

没有。无需将文本文件转换为 csv。您可以使用 python.docx 轻松读取 word 文件。为了首先执行此操作，您需要安装 python.docx。在python 3中：

!pip install python.docx  //install python docx

from  docx import Document  //import docx

doc=open("TextFileName.docx","rb") //creating a word file object

document=docx.Document(doc) //creating word reader object

预处理NLP的.txt文件

1 个答案: