我有以下日志文件中的100个,我希望每个数据集(或DF和字典或其他组合)加载到两个pandas DataFrame中。
将此文件解析为python的最有效方法是什么?
Total Reads 38948036 Total Tags 49242267 Total Assigned Tags 44506208 ===================================================================== Group Total_bases Tag_count Tags/Kb CDS_Exons 34175771 24133928 706.17 5'UTR_Exons 6341914 1366084 215.41 3'UTR_Exons 24930397 8269466 331.70 Introns 929421174 8172570 8.79 TSS_up_1kb 19267668 1044739 54.22 TSS_up_5kb 87647060 1433110 16.35 TSS_up_10kb 159281339 1549571 9.73 TES_down_1kb 19416426 300476 15.48 TES_down_5kb 83322244 718139 8.62 TES_down_10kb 147880768 1014589 6.86 =====================================================================
显然,前三行有参数名称/值,而底部有每组kb的组/总碱基/标签数/标签。在我的所有数据集中,所有这些都将始终存在并且是数字的,因此不需要强大的NA控制。
目前,我正在将文件解析为嵌套列表(每个数据集即文件一个),剥离空白,并从列表中通过索引提取值 - 挑战是如果工具是& #39;生成文件的升级/输出格式稍有变化,例如通过添加新标签,我将有一个非常令人沮丧的时间调试。
答案 0 :(得分:1)
import pandas as pd
import io
temp=u"""Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
=====================================================================
Group Total_bases Tag_count Tags/Kb
CDS_Exons 34175771 24133928 706.17
5'UTR_Exons 6341914 1366084 215.41
3'UTR_Exons 24930397 8269466 331.70
Introns 929421174 8172570 8.79
TSS_up_1kb 19267668 1044739 54.22
TSS_up_5kb 87647060 1433110 16.35
TSS_up_10kb 159281339 1549571 9.73
TES_down_1kb 19416426 300476 15.48
TES_down_5kb 83322244 718139 8.62
TES_down_10kb 147880768 1014589 6.86
====================================================================="""
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_fwf(io.StringIO(temp),
widths=[30,8], #widths of columns
nrows=3, #read only first 3 rows
index_col=[0], #set first column to index
names=[None, 0]) #set column names to None and 0
print (df1)
0
Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
print (df1.T)
Total Reads Total Tags Total Assigned Tags
0 38948036 49242267 44506208
#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+", #separator is arbitrary whitespace
skiprows=4, #skip first 4 rows
comment='=') #skip all rows with first char =
print (df2)
Group Total_bases Tag_count Tags/Kb
0 CDS_Exons 34175771 24133928 706.17
1 5'UTR_Exons 6341914 1366084 215.41
2 3'UTR_Exons 24930397 8269466 331.70
3 Introns 929421174 8172570 8.79
4 TSS_up_1kb 19267668 1044739 54.22
5 TSS_up_5kb 87647060 1433110 16.35
6 TSS_up_10kb 159281339 1549571 9.73
7 TES_down_1kb 19416426 300476 15.48
8 TES_down_5kb 83322244 718139 8.62
9 TES_down_10kb 147880768 1014589 6.86
如果第一列的宽度不总是[30,8]
,请使用:
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp),
nrows=3, #skip first 3 rows
sep="\s\s+", #separator is 2 or more arbitrary whitespaces
engine="python", #clean ParserWarning
index_col=0, #set first column to index
header=None, #no header
names=[None, 0]) #set columns names to None (no index name) and 0
print (df1)
0
Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
print (df1.T)
Total Reads Total Tags Total Assigned Tags
0 38948036 49242267 44506208