确切地说,我在目录698中有一堆文件。每个文件都包含日期和唯一ID以及名称。像这样:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import csv
import os
import re
20151231_7801_Test_Maps.txt
20151231_7801_Test_Items.txt
20151231_7802_Test_Maps.txt
20151231_7802_Test_Items.txt
我希望按日期和标识符对它们进行分组,打开每个文件(地图和项目),并对文件中的某些ID进行差异分析。我该怎么做?
到目前为止,我将此作为我的代码,但我不知道如何遍历并打开每个组的每个文件:
groups = defaultdict(list)
for filename in os.listdir('F:\Desktop'):
date = filename[:8]
identifier = filename[10:14]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
我的输出正确打印了一些组,但不是全部,例如:
('20151231','7801')['20151231_7801_Test_Maps.txt, 20151231_7801_Test_Items.txt]
某些群组只打印一个文件,即使该日期和标识符有两个文件。
这不是我主要关注的问题,但是一旦他们分组,我就想将组中的每个文件分配到一个数据帧,如下所示:
for key in groups:
maps = pd.read_csv(file1, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(file2, sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
#checks IDs between the two files and looks for differences
set(maps.ID).difference(items.ID)
有人可以帮助按日期和ID分组文件,并按组迭代打开文件吗?谢谢!
答案 0 :(得分:1)
从https://stackoverflow.com/a/20228113/6626530获得了一些帮助并完成了
import pandas as pd
from collections import defaultdict
difference = pd.DataFrame(columns=('Filename1', 'Filename2', 'DiffID1','DiffID2'))
pathloc ='C:\Users\shmathew\Desktop\Sample\\abc\\'
groups = defaultdict(list)
for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[10:14]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
for key,filenames in groups.iteritems():
#print " processing following files"
#print filenames
maps = pd.read_csv(pathloc+filenames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
Items = pd.read_csv(pathloc+filenames[0] , sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
df = pd.concat([maps, Items])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
#print "\n\n Difference \n\n"
ids= (df.reindex(idx))
row =list(filenames);
row.extend(list(ids['ID']))
print row
# difference.append(row)
difference.append(row)
print difference
输出
['20151231_7802_Test_Items.txt', '20151231_7802_Test_Maps.txt', '00432931830TRNY1 ', '00432xx0TRNY1 ']
['20151231_7801_Test_Items.txt', '20151231_7801_Test_Maps.txt']
Empty DataFrame
Columns: [Filename1, Filename2, DiffID1, DiffID2]
Index: []
答案 1 :(得分:0)
从Shijo的回答中,我发现了一个非常好的方法。
groups = defaultdict(list)
output = []
for filename in os.listdir(pathloc):
date = filename[:8]
identifier = filename[14:18]
basename, extension = os.path.splitext(filename)
groups[date, identifier].append(filename)
for key, fnames in groups.iteritems():
filedicts = {}
print list(fnames)
maps = pd.read_csv(pathloc+fnames[1], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
items = pd.read_csv(pathloc+fnames[0], sep = '\t', usecols = ['ID'], skipfooter = 0, engine = 'python')
diffs = set(maps.ID).symmetric_difference(items.ID)
filedicts['FileIDKey'] = list(key)
filedicts['Missing_IDs'] = list(diffs)
filedicts['FileNames'] = fnames
output.append(filedicts)
这使我可以将此主词典列表添加到数据帧:
new = pd.DataFrame(output)