命令行:
文件可以在git-hub找到。
File1:
https://raw.githubusercontent.com/felipelira/files_to_test/master/file1.txt
文件2:
https://raw.githubusercontent.com/felipelira/files_to_test/master/file2.txt
命令行: python teste2.py file1.txt file2.txt test
在存在/不存在矩阵中转换表格文件时,我最终错过了一些数据。没有与加入匹配的基因组不是情节。
我以前的结果是这样的(根据帖子Convert tables to presence/absence matrix python - Solved上的脚本和示例):
genome accession1 accession2 accession3 accession4 accession5
genome1 1 1 1 0 0
genome2 1 0 0 1 1
但我需要在我的前向分析中使用其他基因组。 我试图安排这个在df1:
之前移动定义df2的块asmbly_dict = sys.argv[1]
blast_result = sys.argv[2]
outName = sys.argv[3] + '.txt'
with open(blast_result, 'r') as file2:
col_genes = ['gene', 'accession']
df2 = pd.read_csv(file2, sep='\t', header=None, names=col_genes)
print df2
with open(asmbly_dict, 'r') as file1:
col_asmbly = ['gene', 'genome']
df1 = pd.read_csv(file1, sep='\t', header=None, names=col_asmbly)
df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])
#print df1
g = df1.groupby('genome')['accession'].apply(list).reset_index()
testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)
#print testdf.to_string(index=False)
testdf.to_csv(outName, sep='\t', header=True, index=False)
打印df2:
gene accession
0 gene1 accession1
1 gene2 accession2
2 gene3 accession3
3 gene4 accession1
4 gene5 accession4
5 gene6 accession5
打印df1:
gene genome accession
0 gene1 genome1 accession1
1 gene2 genome1 accession2
2 gene3 genome1 accession3
3 gene4 genome2 accession1
4 gene5 genome2 accession4
5 gene6 genome2 accession5
6 gene7 genome3 NaN
7 gene8 genome3 NaN
8 gene9 genome4 NaN
打印testdf:
genome accession1 accession2 accession3 accession4 accession5
genome1 1.0 1.0 1.0 0.0 0.0
genome2 1.0 0.0 0.0 1.0 1.0
genome3 NaN NaN NaN NaN NaN
genome4 NaN NaN NaN NaN NaN
.csv文件:
genome accession1 accession2 accession3 accession4 accession5
genome1 1.0 1.0 1.0 0.0 0.0
genome2 1.0 0.0 0.0 1.0 1.0
genome3
genome4
问题是:
如何在数字(1.0 - > 1)后绘制无小数,如何用零填充空值来打印和写入文件?
答案 0 :(得分:2)
如果要使用原始解决方案,请将fillna
添加到int
:
testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)
testdf = testdf.fillna(0).astype(int)
但更好的解决方案是使用get_dummies
,然后为每个索引和每列设置max
(在实例数据中可能没有必要的样本):
df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])
df1 = pd.get_dummies(df1.set_index('genome')['accession']).max(level=0).max(level=0, axis=1)
或使用crosstab
,clip_upper
并按reindex
添加缺失的类别:
df1 = (pd.crosstab(df1['genome'], df1['accession'])
.clip_upper(1)
.reindex(df1['genome'].unique(), fill_value=0))
或者:
df1 = (df1.groupby(['genome', 'accession'])
.size()
.clip_upper(1)
.unstack(fill_value=0)
.reindex(df1['genome'].unique(), fill_value=0))
print (df1)
accession1 accession2 accession3 accession4 accession5
genome
genome1 1 1 1 0 0
genome2 1 0 0 1 1
genome3 0 0 0 0 0
genome4 0 0 0 0 0
并最后写入文件:
df1.to_csv(outName, sep='\t')