我有多个文件,其命名约定如下。
ENCSR000EQO_0_0.txt
ENCSR000DIA_0_0.txt
ENCSR000DIA_1_1.txt
ENCSR000DIA_2_1.txt
ENCSR000DIM_0_0.txt
ENCSR000DIM_1_1.txt
ENCSR000AIB_0_0.txt
ENCSR000AIB_1_1.txt
ENCSR000AIB_2_1.txt
ENCSR000AIB_3_1.txt
我想根据文件名使用pandas将它们合并为数据帧,因此我将得到4个结果数据帧。然后对于这4个中的每一个,我想通过基因(GeneName)列进行分组。由于同一基因会出现多次。
它们都以相同的顺序具有相同的列。我可以一次合并所有10个,但我无法弄明白如何按名称合并。
path = '/renamed/'
print os.listdir(path)
df_merge = None
for fname in os.listdir(path):
if fname.endswith('.txt'):
df = pd.read_csv(path + fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df = df.groupby('GeneName').agg(np.mean)
print df
感谢您的任何意见。
答案 0 :(得分:2)
我会做更多这样的事情,您可以使用glob
获取文件名,检查每个文件名,然后对连接的结果进行分组。
import glob
path = 'renamed'
df_merge = None
for fid in ('EQO', 'DIA', 'DIM', 'AIB'):
df_ = pd.DataFrame()
for fname in glob.glob(os.path.join(path, '*.txt')):
if fid in fname:
df = pd.read_csv(fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df_ = pd.concat((df_, df))
df_ = df_.groupby('GeneName').agg(np.mean)
print df_
编辑:将答案扩展为更加自动化。
根据您的文件名,您可以按如下方式识别它们:
import numpy as np
files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])
将所有更新的代码放在一起就是:
import glob
import numpy as np
path = 'renamed'
files = glob.glob(os.path.join(path, '*.txt'))
fids = np.unique([file.split('_')[0] for file in files])
df_merge = None
for fid in fids:
df_ = pd.DataFrame()
for fname in files:
if fid in fname:
df = pd.read_csv(fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df_ = pd.concat((df_, df))
df_ = df_.groupby('GeneName').agg(np.mean)
print df_
答案 1 :(得分:1)
尝试将文件名添加为列,将所有df附加到列表并连接它们,然后分组:
df_merge = []
for fname in os.listdir(path):
if fname.endswith('.txt'):
df = pd.read_csv(path + fname, sep='\t', header=0)
df.columns = ['ID ', 'Chr', 'Start', 'End', 'Strand', 'Peak Score', 'Focus Ratio/Region Size',
'Ann', 'DetAnn', 'Distance', 'PromoterID', 'EID',
'Unigene', 'Refseq', 'Ensembl', 'GeneName', 'GeneAlias',
'GeneDescription', 'GeneType']
df['fname'] = [fname.split('_')[0] for x in df.index] #just to multiple by length
df_merge.append(df)
df_all = pd.concat(df_merge)
for fn in set(df_all['fname'].values):
print df_all[df_all['fname']==fn].groupby('GeneName').agg(np.mean)