我可以按日期和ID对文件进行分组并对它们进行差异吗?

时间:2017-01-17 18:11:35

标签: python python-2.7 pandas set defaultdict

确切地说,我在目录698中有一堆文件。每个文件都包含日期和唯一ID以及名称。像这样:

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re

20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt

我希望按日期和标识符对它们进行分组,打开每个文件(地图和项目),并对文件中的某些ID进行差异分析。我该怎么做?

到目前为止,我将此作为我的代码,但我不知道如何遍历并打开每个组的每个文件:

groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)

我的输出正确打印了一些组,但不是全部,例如:

('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]

某些群组只打印一个文件,即使该日期和标识符有两个文件。

这不是我主要关注的问题,但是一旦他们分组,我就想将组中的每个文件分配到一个数据帧,如下所示:

for key in groups:
    maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')

    #checks IDs between the two files and looks for differences
    set(maps.ID).difference(items.ID)

有人可以帮助按日期和ID分组文件,并按组迭代打开文件吗?谢谢!

2 个答案:

答案 0 :(得分:1)

https://stackoverflow.com/a/20228113/6626530获得了一些帮助并完成了

   import pandas as pd


from collections import defaultdict

difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2'))

pathloc ='C:\Users\shmathew\Desktop\Sample\\abc\\'
groups = defaultdict(list)
for filename in os.listdir(pathloc):
    date = filename[:8]
    identifier = filename[10:14]
    basename, extension = os.path.splitext(filename)
    groups[date, identifier].append(filename)



for key,filenames  in groups.iteritems():
    #print " processing following files"
    #print filenames
    maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    Items = pd.read_csv(pathloc+filenames[0]  , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
    df = pd.concat([maps, Items])
    df = df.reset_index(drop=True)
    df_gpby = df.groupby(list(df.columns))
    idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]




    #print "\n\n Difference \n\n"
    ids= (df.reindex(idx))
    row =list(filenames);
    row.extend(list(ids['ID']))

    print  row 
   # difference.append(row)
    difference.append(row)
print difference

输出

['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1    ', '00432xx0TRNY1    ']
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt']
Empty DataFrame
Columns: [Filename1, Filename2, DiffID1, DiffID2]
Index: []

答案 1 :(得分:0)

从Shijo的回答中,我发现了一个非常好的方法。

groups = defaultdict(list)
output = []

for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)


for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')



diffs = set(maps.ID).symmetric_difference(items.ID)

filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)                         
filedicts['FileNames'] = fnames

output.append(filedicts)

这使我可以将此主词典列表添加到数据帧:

new = pd.DataFrame(output)