比较表以创建存在/不存在矩阵填充空而不包含小数

时间:2018-03-01 13:27:06

标签: python pandas pandas-groupby

命令行:

文件可以在git-hub找到。

File1:

https://raw.githubusercontent.com/felipelira/files_to_test/master/file1.txt

文件2:

https://raw.githubusercontent.com/felipelira/files_to_test/master/file2.txt

命令行:     python teste2.py file1.txt file2.txt test

在存在/不存在矩阵中转换表格文件时,我最终错过了一些数据。没有与加入匹配的基因组不是情节。

我以前的结果是这样的(根据帖子Convert tables to presence/absence matrix python - Solved上的脚本和示例):

genome  accession1  accession2  accession3  accession4  accession5
genome1           1           1           1           0           0
genome2           1           0           0           1           1

但我需要在我的前向分析中使用其他基因组。 我试图安排这个在df1:

之前移动定义df2的块
asmbly_dict = sys.argv[1]
blast_result = sys.argv[2]
outName = sys.argv[3] + '.txt'

with open(blast_result, 'r') as file2:
    col_genes = ['gene', 'accession']
    df2 = pd.read_csv(file2, sep='\t', header=None, names=col_genes)
    print df2

with open(asmbly_dict, 'r') as file1:
    col_asmbly = ['gene', 'genome']
    df1 = pd.read_csv(file1, sep='\t', header=None, names=col_asmbly)
    df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])
    #print df1
    g = df1.groupby('genome')['accession'].apply(list).reset_index()
    testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)
    #print testdf.to_string(index=False)
    testdf.to_csv(outName, sep='\t', header=True, index=False)

打印df2:

    gene   accession
0  gene1  accession1
1  gene2  accession2
2  gene3  accession3
3  gene4  accession1
4  gene5  accession4
5  gene6  accession5

打印df1:

    gene   genome   accession
0  gene1  genome1  accession1
1  gene2  genome1  accession2
2  gene3  genome1  accession3
3  gene4  genome2  accession1
4  gene5  genome2  accession4
5  gene6  genome2  accession5
6  gene7  genome3         NaN
7  gene8  genome3         NaN
8  gene9  genome4         NaN

打印testdf:

genome  accession1  accession2  accession3  accession4  accession5
genome1         1.0         1.0         1.0         0.0         0.0
genome2         1.0         0.0         0.0         1.0         1.0
genome3         NaN         NaN         NaN         NaN         NaN
genome4         NaN         NaN         NaN         NaN         NaN

.csv文件:

genome  accession1  accession2  accession3  accession4  accession5
genome1         1.0         1.0         1.0         0.0         0.0
genome2         1.0         0.0         0.0         1.0         1.0
genome3
genome4

问题是:

如何在数字(1.0 - > 1)后绘制无小数,如何用零填充空值来打印和写入文件?

1 个答案:

答案 0 :(得分:2)

如果要使用原始解决方案,请将fillna添加到int

testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)

testdf = testdf.fillna(0).astype(int)

但更好的解决方案是使用get_dummies,然后为每个索引和每列设置max(在实例数据中可能没有必要的样本):

df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])

df1 = pd.get_dummies(df1.set_index('genome')['accession']).max(level=0).max(level=0, axis=1)

或使用crosstabclip_upper并按reindex添加缺失的类别:

df1 = (pd.crosstab(df1['genome'], df1['accession'])
        .clip_upper(1)
        .reindex(df1['genome'].unique(), fill_value=0))

或者:

df1 = (df1.groupby(['genome', 'accession'])
         .size()
         .clip_upper(1)
         .unstack(fill_value=0)
         .reindex(df1['genome'].unique(), fill_value=0))
print (df1)
         accession1  accession2  accession3  accession4  accession5
genome                                                             
genome1           1           1           1           0           0
genome2           1           0           0           1           1
genome3           0           0           0           0           0
genome4           0           0           0           0           0

并最后写入文件:

df1.to_csv(outName, sep='\t')