将文章从文件夹映射到列表

时间:2016-05-20 15:52:39

标签: python string list function

我有一个文章很少的文件夹,我想将每篇文章的文本映射到一个公共列表,以便使用该列表进行tf-idf转换。例如:

folder = [article1,article2,article3]

进入清单

list = [' text_of_article1',' text_of_article2',' text_of_article3']

def multiple_file(arg):     #arg is path to the folder with multiple files
    '''Function opens multiple files in a folder and maps each of them to a list
    as a string'''
    import glob, sys, errno
    path = arg
    files = glob.glob(path)
    list = []               #list where file string would be appended
    for item in files:    
        try:
            with open(item) as f: # No need to specify 'r': this is the default.
                list.append(f.read())
        except IOError as exc:
            if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
                raise # Propagate other kinds of IOError.
    return list

当我使用我的文章设置文件夹的路径时,我得到一个空列表。但是,当我将其直接设置为一篇文章时,该文章将出现在列表中。我怎么能把它们全部映射到我的列表中。 :S

这是代码,不确定这是否是您的想法:

def multiple_files(arg):     #arg is path to the folder with multiple files
    '''Function opens multiple files in a folder and maps each of them to a list
    as a string'''
    import glob, sys, errno, os
    path = arg
    files = os.listdir(path)
    list = []               #list where file string would be appended
    for item in files:    
        try:
            with open(item) as f: # No need to specify 'r': this is the default.
                list.append(f.read())
        except IOError as exc:
            if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
                raise # Propagate other kinds of IOError.
    return list

这就是错误:

Traceback (most recent call last):

  File "<ipython-input-7-13e1457699ff>", line 1, in <module>
    x = multiple_files(path)

  File "<ipython-input-5-6a8fab5c295f>", line 10, in multiple_files
    with open(item) as f: # No need to specify 'r': this is the default.

IOError: [Errno 2] No such file or directory: 'u02.txt'

第2条实际上是新创建的清单中的第一个。

1 个答案:

答案 0 :(得分:0)

假设path == "/home/docs/guzdeh"。如果您只是说glob.glob(path),则只会获得[path],因为其他任何内容都与该模式不符。您希望glob.glob(path + "/*")获取该目录中的所有内容,或glob.glob(path + "/*.txt")获取所有txt个文件。

或者你可以使用import os; os.listdir(path),我认为这更有意义。

更新:

关于新代码,问题是os.listdir仅返回相对于列出的目录的路径。因此,您需要将两者结合使用才能知道您在哪里谈论。添加:

item = os.path.join(path, item)
在尝试open(item)之前

。您可能还想更好地命名变量。