我有一个文章很少的文件夹,我想将每篇文章的文本映射到一个公共列表,以便使用该列表进行tf-idf转换。例如:
folder = [article1,article2,article3]
进入清单
list = [' text_of_article1',' text_of_article2',' text_of_article3']
def multiple_file(arg): #arg is path to the folder with multiple files
'''Function opens multiple files in a folder and maps each of them to a list
as a string'''
import glob, sys, errno
path = arg
files = glob.glob(path)
list = [] #list where file string would be appended
for item in files:
try:
with open(item) as f: # No need to specify 'r': this is the default.
list.append(f.read())
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
return list
当我使用我的文章设置文件夹的路径时,我得到一个空列表。但是,当我将其直接设置为一篇文章时,该文章将出现在列表中。我怎么能把它们全部映射到我的列表中。 :S
这是代码,不确定这是否是您的想法:
def multiple_files(arg): #arg is path to the folder with multiple files
'''Function opens multiple files in a folder and maps each of them to a list
as a string'''
import glob, sys, errno, os
path = arg
files = os.listdir(path)
list = [] #list where file string would be appended
for item in files:
try:
with open(item) as f: # No need to specify 'r': this is the default.
list.append(f.read())
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
return list
这就是错误:
Traceback (most recent call last):
File "<ipython-input-7-13e1457699ff>", line 1, in <module>
x = multiple_files(path)
File "<ipython-input-5-6a8fab5c295f>", line 10, in multiple_files
with open(item) as f: # No need to specify 'r': this is the default.
IOError: [Errno 2] No such file or directory: 'u02.txt'
第2条实际上是新创建的清单中的第一个。
答案 0 :(得分:0)
假设path == "/home/docs/guzdeh"
。如果您只是说glob.glob(path)
,则只会获得[path]
,因为其他任何内容都与该模式不符。您希望glob.glob(path + "/*")
获取该目录中的所有内容,或glob.glob(path + "/*.txt")
获取所有txt
个文件。
或者你可以使用import os; os.listdir(path)
,我认为这更有意义。
更新:
关于新代码,问题是os.listdir
仅返回相对于列出的目录的路径。因此,您需要将两者结合使用才能知道您在哪里谈论。添加:
item = os.path.join(path, item)
在尝试open(item)
之前。您可能还想更好地命名变量。