我正在尝试将列表的元素分配为存在于目录中的某些文件的名称,到目前为止,我创建了一个函数,用于从目录中恢复每个文件的名称并将其返回到列表中:
SELECT COLUMN_NAME
FROM information_schema.COLUMNS
WHERE
TABLE_NAME = 't' AND
TABLE_SCHEMA = 'db1';
上面的函数在列表中返回每个文件的名称,然后我将文件写入另一个目录,如下所示:
def retrive(directory_path):
path_names = []
for filename in sorted(glob.glob(os.path.join(directory_path, '*.pdf'))):
retrieved_files = filename.split('/')[-1]
path_names.append(retrieved_files)
print (path_names)
最后,我的问题是:我如何指定每个文件的名称, path = os.path.join(new_dir_path, "list%d.txt" % i)
#This is the path of each new file:
#print(path)
with codecs.open(path, "w", encoding='utf8') as filename:
for item in [a_list]:
filename.write(item+"\n")
的每个元素?,类似这一行:
path_names
我还尝试使用path = os.path.join(new_dir_path, "list%d.txt" % i)
功能。但是我仍然无法为每个文件指定正确的名称。
这是完整的脚本:
format()
所需的输出将包含每个已处理文件的实际名称。
答案 0 :(得分:1)
你快到了:
for path_name in path_names:
path = os.path.join(new_dir_path, "list%s.txt" % path_name)
#This is the path of each new file:
#print(path)
with codecs.open(path, "w", encoding='utf8') as f:
for item in [a_list]:
f.write(item+"\n")
根据更新的代码示例进行更新。你在这里使用不同的循环,除非你在两个循环之间进行处理,否则这不是理想的。由于我将保留该结构,因此我们必须确保将每个内容块与原始文件名相关联。最好的结构是dict,如果顺序很重要,我们使用OrderedDict
。现在,当我们循环文件名时,OrderedDict
中的内容对我们想要更改文件的扩展名以匹配新的文件类型。幸运的是,python在os.path
模块中有一些很好的文件/路径操作实用程序。 os.path.basename
可用于从文件中剥离目录,os.path.splitext
将从文件名中删除扩展名。我们使用这两个来获取没有扩展名的文件名,然后追加.txt
来指定新的文件类型。把它们放在一起,我们得到:
def transform_directoy(input_directory, output_directory):
import codecs, glob, os
from collections import OrderedDict
from tika import parser
all_texts = OrderedDict()
for filename in sorted(glob.glob(os.path.join(input_directory, '*.pdf'))):
parsed = parser.from_file(filename)
filename = os.path.basename(filename)
texts = parsed['content']
all_texts[filename] = texts
for i, (original_filename, a_list) in enumerate(all_texts.items()):
new_filename, _ = os.path.splitext(original_filename)
new_filename += '.txt'
new_dir_path = output_directory
#print(new_dir_path)
path = os.path.join(new_dir_path, new_filename)
# Print out the name of the file we are processing
print('Transforming %s => %s' % (original_filename, path,))
with codecs.open(path, "w", encoding='utf8') as filename:
for item in [a_list]:
filename.write(item+"\n")
第二次更新:OP询问我如何编写此代码,如果这就是全部,那么这里是:
# move imports to top of file: PEP 8
import codecs, glob, os
from tika import parser
def transform_directoy(input_directory, output_directory):
for filename in sorted(glob.glob(os.path.join(input_directory, '*.pdf'))):
parsed = parser.from_file(filename)
parsed_content = parsed['content']
original_filename = os.path.basename(filename)
new_filename, _ = os.path.splitext(original_filename)
new_filename += '.txt'
path = os.path.join(output_directory, new_filename)
# Print out the name of the file we are processing
print('Transforming %s => %s' % (original_filename, path,))
# no need for a second loop since we can piggy back off the first loop
with codecs.open(path, "w", encoding='utf8') as filename:
# No need for a for loop here since our list only has one item
filename.write(parsed_content)
filename.write("\n")