如何将一组文件收集到一个pandas数据框中?

时间:2016-12-28 01:58:56

标签: python python-3.x pandas dataframe

我的目录中有txt个文件:

.
├── file.txt
├── file.txt
├── file.txt
...
├── file.txt
└── file.txt

我如何读入pandas数据框中的所有文档?换句话说,我的目标是在pandas dataframe对象中存储一些像这样的文档(*):

    id  text_blob
0   file_name.txt   Lore lipsum dolor done
1   file_name.txt   Lore lipsum ...
2   file_name.txt   dolor ...
3   file_name.txt   lore lipsum lore ...
4   file_name.txt   dolor...

到目前为止,我尝试了以下代码。然而,它不是pythonic,我有一些格式错误(例如空间问题,'",格式化。):

import glob, os, csv, argparse, sys

def retrive(directory_path):
    for filename in sorted(glob.glob(os.path.join(directory_path, '*.txt'))):
        with open(filename, 'r') as f:
            important_stuff = f.read().splitlines()
            oneline = [' '.join(important_stuff)]
            yield filename.split('/')[-1] + ', ' +str(oneline).strip('[]"')

def trans(directory,directory2):
            test = tuple(retrive(directory))
            with codecs.open(directory2,'w', encoding='utf8') as out:
                csv_out=csv.writer(out, delimiter='|')
                csv_out.writerow(['name','text_blob'])
                for row in test:
                    csv_out.writerow(row.split(', ', 1))


input_d = '../in'
out_d = '../out'



trans(input_d,out_d)

1 个答案:

答案 0 :(得分:1)

import glob, os
import pandas as pd

input_d = '../in'
filenames = []
blobs = []
for pathname in sorted(glob.glob(os.path.join(input_d, '*.txt'))):
    with open(pathname, 'r') as txtfile:
        filename = os.path.basename(pathname)
        filenames.append(filename)
        blob = ' '.join(txtfile.read().splitlines())
        blobs.append(blob)

df = pd.DataFrame({'id':filenames, 'text_blob':blobs})

Pandas数据框可以在many ways.中创建。其中一个是传递dict对象。