合并来自不同文件的数据

时间:2017-05-29 13:49:30

标签: python

我有多个文件用于问题目的我只显示两个文件:

Error:(3414) Error retrieving parent for item: No resource found that matches the given name 'ShowcaseView.Light'.
Error:(3415, 21) No resource found that matches the given name: attr 'sv_backgroundColor'.
Error:(3414) Error retrieving parent for item: No resource found that matches the given name 'ShowcaseView.Light'.
Error:(3415, 21) No resource found that matches the given name: attr 'sv_backgroundColor'.

所需的输出:首先从所有40个文件中收集id列的所有元素,并将列标题命名为value_TXT1(文件名)。如果找到值,则输入0。

TXT1
id   value 
1    4
2    4
4    5
TXT2
id   value   
2    6   
3    5
5    3   

我在目录中有大约40个文件,我需要像这样制作一个决赛桌,所以我的决赛桌会有

    id   value_TXT1 value_TXT2   
    1    4          0
    2    4          6  
    3    0          5
    4    5          0
    5    3          0 

任何伪代码或教程都会有所帮助,道歉我没有尝试任何东西,因为我很困惑如何处理这个问题。

编辑: 这是我迄今为止从不同来源尝试的内容:

id   value_TXT1 value_TXT2........valueTXT40

两个名为

的文件
import glob  
import os
data_dict = {}
path = '/Users/a/Desktop/combine/*.txt' 
paths = '/Users/a/Desktop/combine/' 
files=glob.glob(path)  
filelist = os.listdir(paths) #Make a file list
file_names=[os.path.splitext(x)[0] for x in filelist] #header
print file_names
for file in files:     
    f=open(file, 'r') 
    f.readline()
    for i in f:
        (key, value) = i.split()
        data_dict[key]=value

print data_dict

output: 
['combine', 'combine2']
{'1': '4', '3': '5', '2': '4', '5': '3', '4': '5'}

3 个答案:

答案 0 :(得分:1)

首先解析所有40个文件,然后获取字典data_dict

(伪代码)

data_dict = {}
def parse_file(txt_i):
    for id, value in data_rows:
        if id not in data_dict:
            data_dict[id] = [0 ... 0]    # 40 zeros indicate the default values from each TXT file
        data_dict[id][i] = value     # set value of the ith TXT file

然后以您想要的格式打印出data_dict的内容。

for id in data_dict:
    print id
    for value in data_dict[id]:
        print value

请记住要处理标题。 (id value_TXT1 value_TXT2 ........ valueTXT40)

答案 1 :(得分:1)

在此,我建议您基于以下假设的解决方案:

1)文件全部以制表符分隔或以逗号分隔

2)逗号仅作为分隔符出现

3)您要处理的所有文件都在同一个文件夹中

这就是:

#1 make a list fo files to precess
import glob
folder = 'path_to_my_folder'
extension = '*.txt' #it can be *.*
files = glob.glob(folder + '/' + extension)

#2 initialize a dict
data = {}

#3 read all the files and update the dict

for n, file in enumerate(files):
    with open(file, 'r') as f:
        separator = False
        for line in f:
            if line[0] == 'E': #check for ID-containing lines
                if ',' in line:
                    separator = ','
                else:
                    separator = '\t'
                id, value = line.strip().split(separator)
                try:
                    data[id].append(value)
                except KeyError:
                    data[id] = []
                    #fill with 0 the id not found on previous files
                    while len(data[id]) < n: 
                        data[id].append(0)
                    data[id].append(value)

     #fill with 0 the id not found on this file
     for k,v in data.items(): #.iteritems() on python2
         while len(v) < n+1: #if n=0 then len must be 1
             data[k].append(0)

#print the result
#first line
print('id', end='')
for file in files:
    print('\t{}'.format(file), end='')
#the rest
for k, v in data.items():
    print('\n{}'.format(k), end='')
    for item in v:
       print('\t{}'.format(item), end='')


#to write it in a file
with open('myfile.txt' , 'w') as f:
    #write header
    f.write('id')
    for file in files:
        f.write('\t{}'.format(file))
    f.write('\n') #go to the next line (optional)

    for k, v in data.items():
        f.write('\n{}'.format(k))
        for item in v:
           f.write('\t{}'.format(item))

答案 2 :(得分:1)

我假设:

  • 文件位于同一文件夹中
  • 他们都以“TXT”开头
  • 文本以制表符分隔

要求:pandas

输入:

TXT1  

1    4
2    3
3    5
4    3
7    5

TXT2
1    4
2    4
4    5
6    3

这里是代码:

    import pandas as pd
    import glob

    path = "/my/full/path/"
    file_list = glob.glob1(path, "TXT*")
    res = pd.DataFrame()
    for filename in file_list:
        df = pd.read_csv(path+filename, header=None, sep="    ", index_col=0, names=["values_"+file])
        res = pd.concat([res,df], axis=1)
   res = res.fillna(0)
   print res.astype(int)

输出:

       values_TXT1  values_TXT2
    1            4            4
    2            3            4
    3            5            0
    4            3            5
    6            0            3
    7            5            0

您还可以使用以下命令将其导出到csv: res.to_csv("export.csv", sep=",")
您可以在documentation

中找到更多参数