Question

我有一个约有500个.txt文件的文件夹。我想将内容存储在一个csv文件中，具有2列，第1列是文件的名称，第2列是字符串中的文件内容。所以我最终得到了一个501行的CSV文件。

我一直在窥探SO，并试图找到类似的问题，并提出了以下代码：

import pandas as pd
from pandas.io.common import EmptyDataError
import os


def Aggregate_txt_csv(path):
    for files in os.listdir(path):
            with open(files, 'r') as file:
                try: 
                    df = pd.read_csv(file, header=None, delim_whitespace=True)
                except EmptyDataError:
                    df = pd.DataFrame()
                
            return df.to_csv('file.csv', index=False)

但是，它返回一个空的.csv文件。我在做错什么吗？

Answer 1

您的代码有几个问题。其中之一是pd.read_csv没有打开file，因为您没有将路径传递给给定文件。我认为您应该尝试使用此代码

import os
import pandas as pd
from pandas.io.common import EmptyDataError

def Aggregate_txt_csv(path):
    files = os.listdir(path)
    df = []
    for file in files:
        try: 
            d = pd.read_csv(os.path.join(path, file), header=None, delim_whitespace=True)
            d["file"] = file
        except EmptyDataError:
            d = pd.DataFrame({"file":[file]})
        df.append(d)
    df = pd.concat(df, ignore_index=True)
    df.to_csv('file.csv', index=False)

Answer 2

使用pathlib
- Path.glob()查找所有文件
- 使用路径对象时，file.stem从路径返回文件名。
使用pandas.concat组合df_list中的数据框

from pathlib import Path
import pandas as pd

p = Path('e:/PythonProjects/stack_overflow')  # path to files
files = p.glob('*.txt')  # get all txt files

df_list = list()  # create an empty list for the dataframes
for file in files:  # iterate through each file
    with file.open('r') as f:
        text = '\n'.join([line.strip() for line in f.readlines()])  # join all rows in list as a single string separated with \n
        
    df_list.append(pd.DataFrame({'filename': [file.stem], 'contents': [text]}))  # create and append a dataframe


df_all = pd.concat(df_list)  # concat all the dataframes

df_all.to_csv('files.txt', index=False)  # save to csv

Answer 3

我注意到已经有了答案，但是我已经将其与相对简单的代码一起使用。我只编辑了一点读入的文件，并且数据帧输出成功。

Link here

import pandas as pd
from pandas.io.common import EmptyDataError
import os


def Aggregate_txt_csv(path):
    result = []
    print(os.listdir(path))
    for files in os.listdir(path):
        fullpath = os.path.join(path, files)
        if not os.path.isfile(fullpath):
            continue

        with open(fullpath, 'r', errors='replace') as file:
            try:
                content = '\n'.join(file.readlines())
                result.append({'title': files, 'body': content})
            except EmptyDataError:
                result.append({'title': files, 'body': None})
            
    df = pd.DataFrame(result)
    return df

df = Aggregate_txt_csv('files')
print(df)
df.to_csv('result.csv')

在这里最重要的是，我要追加到数组中，以免过多运行pandas的串联函数，因为这对性能非常不利。另外，读入文件不需要read_csv，因为文件没有设置格式。因此，使用'\n'.join(file.readlines())可以使您简单地读取文件，并将所有行都提取为字符串。

最后，我将字典数组转换为最终数据帧，并返回结果。

编辑：对于不是当前目录的路径，我对其进行了更新以添加该路径，以便可以找到必要的文件，对此感到抱歉。

熊猫-尝试在.csv中存储多个.txt文件

3 个答案: