Question

我正在编写一个程序，该程序需要循环浏览目录中的一堆文件，并将.docx和.pdf文件转换为.txt。困难在于文件也需要重命名。我编写了一个函数，将docx转换为文本，然后将它们保存到同一文件夹中。一旦可以使用，我将其扩展为pdf。函数和代码运行，但是没有创建.txt文件，并且在没有错误描述的情况下，我无法弄清原因。任何帮助将不胜感激！

*注意：该代码确实找到了一堆需要转换的.docx文件。似乎并没有创建新的.txt版本 #import一个将单词转换为文本的模块导入textract

#define a function that takes a word doc, a title and converts it to text
def to_text(word_doc_path,title):
    tmp_text = textract.process(word_doc_path, extension = 'docx')
    new_file = open(title,'wb')
    new_file.write(tmp_text)
    new_file.close()

#import modules for paths and regex
import os
import re

#define paths
path = 'C:\\my\\path\\'

files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
    if '.docx' in file:
        files.append(os.path.join(r, file))


for f in files:
    #regex to find name of pxx
    pxx = re.search(r'(?<=pxx )(.*)(?=\\)',f)
    #regex to find cxx
    cxx = re.search(r'[A-Z]{4} [0-9]{3}(?<![A-Z]{5})(?![A-Z])',f)
    #regex to find txx and year
    y_term = re.search(r'[0-9]{4} fxx|[0-9]{4} sxx',f)
    if y_term is not None and pxx is not None and cxx is not None:
        tmp_title = y_term.group(0)+'-'+pxx.group(0)+'_'+cxx.group(0)+'.txt'
        to_text(f,tmp_title)

Answer 1

Attempt1：这似乎是一个正则表达式问题，如果您的搜索功能之一返回None，则不会创建文件。

尝试2：tmp_title是文件名，而不是完整路径，您是否检查了文件是否没有与脚本位于同一文件夹中创建？

将docx转换为文本文件并将其保存在同一文件夹中

1 个答案: