我已经创建了一个python pandas数据框对象,现在我正在尝试将其写入csv。以下命令在Windows中可以使用,但是当我在Unix中完全相同的数据上运行相同的代码时,列标题在写入csv后不与列对齐。当我从命令行运行它时,df在Python中看起来很好(例如,df [" somecolumn"]向我展示了我的期望)。有什么想法吗?
df.to_csv(str(outfile), sep=",", header=True, index=False, na_rep="NA", cols=firstcols)
编辑:这里是我的输入,我会谎言合并第一列" VAR":
> more infile*
::::::::::::::
infile1.tsv
::::::::::::::
VAR CHROM POS REF ALT P IPU IRF IUC IGN
1:12892:TGG:T 1 12892 TGG T 0.1383 . intergenic ncRNA NONE(dist=NONE)
1:14397:CTGT:C 1 14397 CTGT C 0.5863 . ncRNA ncRNA WASH7P
1:17084:GGT:G 1 17084 GGT G 0.2337 . ncRNA ncRNA WASH7P
1:17421:ATG:A 1 17421 ATG A 0.1089 . ncRNA ncRNA WASH7P
::::::::::::::
infile2.tsv
::::::::::::::
VAR CHROM POS REF ALT P IPU IRF IUC IGN
1:14567:G:GAT 1 14567 G GAT 0.1299 . ncRNA ncRNA WASH7P
1:14670:TG:T 1 14670 TG T 0.1319 . ncRNA ncRNA WASH7P
1:14745:GGC:G 1 14745 GGC G 0.1462 . ncRNA ncRNA WASH7P
1:14905:GA:G 1 14905 GA G 0.1307 . ncRNA ncRNA WASH7P
::::::::::::::
infile3.tsv
::::::::::::::
VAR CHROM POS REF ALT AC AF PU RFG GI
21:10862612:G:A 21 10862612 G A 3 0.00 intergenic NONE(dist=NONE),NONE(dist=NONE),TEKT4P2(dist=894019),TPTE(dist=44131),AK311573(dist=265170),TPTE(dist=44131),ENSG00000169861
21:10862618:T:C 21 10862618 T C 14183 0.65 intergenic NONE(dist=NONE),NONE(dist=NONE),TEKT4P2(dist=894025),TPTE(dist=44125),AK311573(dist=265176),TPTE(dist=44125),ENSG00000169861
21:10862623:T:C 21 10862623 T C 1 0.00 intergenic NONE(dist=NONE),NONE(dist=NONE),TEKT4P2(dist=894030),TPTE(dist=44120),AK311573(dist=265181),TPTE(dist=44120),ENSG00000169861:ENST0000030209
这是我的python脚本
import csv
import pandas
import glob
from glob import iglob
inpath = '*.tsv'
outfile ="merged.out"
merged = pandas.concat([pandas.read_csv(f, sep='\t', parse_dates=False) for f in glob.iglob(inpath)], axis=0)
dfcols = merged.columns.tolist()
firstcols = ['VAR', 'CHROM', 'POS', 'REF', 'ALT'] #preserve order of first 5 columns only
for d in set(dfcols):
if d not in firstcols:
firstcols.append(d)
merged.to_csv(str(outfile), sep="\t", header=True, index=False, na_rep="NA", cols=firstcols)
以下是截断输出的一些内容,用于说明标头的匹配方式:
> more merged.out
VAR CHROM POS REF ALT AC AF IUC PU P RFG IRF IPU IGN GI
NA NA T 1 NA NONE(dist=NONE) . intergenic ncRNA 0.1383 12892 NA TGG NA 1:12892:TGG:T
NA NA C 1 NA WASH7P . ncRNA ncRNA 0.5863 14397 NA CTGT NA 1:14397:CTGT:C
但是,列在Python环境中看起来是否正确注释?我很难过
>>> merged['VAR']
0 1:12892:TGG:T
1 1:14397:CTGT:C
2 1:17084:GGT:G
3 1:17421:ATG:A
0 1:14567:G:GAT
1 1:14670:TG:T
2 1:14745:GGC:G
3 1:14905:GA:G
0 21:10862612:G:A
1 21:10862618:T:C
2 21:10862623:T:C
3 21:10862624:G:T