列不匹配在Unix和Windows中将Python Pandas数据帧写入csv

时间:2014-04-14 23:32:48

标签: python unix csv pandas

我已经创建了一个python pandas数据框对象,现在我正在尝试将其写入csv。以下命令在Windows中可以使用,但是当我在Unix中完全相同的数据上运行相同的代码时,列标题在写入csv后不与列对齐。当我从命令行运行它时,df在Python中看起来很好(例如,df [" somecolumn"]向我展示了我的期望)。有什么想法吗?

df.to_csv(str(outfile), sep=",", header=True, index=False, na_rep="NA", cols=firstcols)

编辑:这里是我的输入,我会谎言合并第一列" VAR":

> more infile*
::::::::::::::
infile1.tsv
::::::::::::::
VAR     CHROM   POS     REF     ALT     P       IPU     IRF     IUC     IGN
1:12892:TGG:T   1       12892   TGG     T       0.1383  .       intergenic      ncRNA   NONE(dist=NONE)
1:14397:CTGT:C  1       14397   CTGT    C       0.5863  .       ncRNA   ncRNA   WASH7P
1:17084:GGT:G   1       17084   GGT     G       0.2337  .       ncRNA   ncRNA   WASH7P
1:17421:ATG:A   1       17421   ATG     A       0.1089  .       ncRNA   ncRNA   WASH7P
::::::::::::::
infile2.tsv
::::::::::::::
VAR     CHROM   POS     REF     ALT     P       IPU     IRF     IUC     IGN
1:14567:G:GAT   1       14567   G       GAT     0.1299  .       ncRNA   ncRNA   WASH7P
1:14670:TG:T    1       14670   TG      T       0.1319  .       ncRNA   ncRNA   WASH7P
1:14745:GGC:G   1       14745   GGC     G       0.1462  .       ncRNA   ncRNA   WASH7P
1:14905:GA:G    1       14905   GA      G       0.1307  .       ncRNA   ncRNA   WASH7P
::::::::::::::
infile3.tsv
::::::::::::::
VAR     CHROM   POS     REF     ALT     AC      AF      PU      RFG     GI
21:10862612:G:A 21      10862612 G      A       3       0.00            intergenic      NONE(dist=NONE),NONE(dist=NONE),TEKT4P2(dist=894019),TPTE(dist=44131),AK311573(dist=265170),TPTE(dist=44131),ENSG00000169861
21:10862618:T:C 21      10862618 T      C       14183   0.65            intergenic      NONE(dist=NONE),NONE(dist=NONE),TEKT4P2(dist=894025),TPTE(dist=44125),AK311573(dist=265176),TPTE(dist=44125),ENSG00000169861
21:10862623:T:C 21      10862623 T      C       1       0.00            intergenic      NONE(dist=NONE),NONE(dist=NONE),TEKT4P2(dist=894030),TPTE(dist=44120),AK311573(dist=265181),TPTE(dist=44120),ENSG00000169861:ENST0000030209

这是我的python脚本

import csv
import pandas
import glob
from glob import iglob

inpath = '*.tsv'
outfile ="merged.out"

merged = pandas.concat([pandas.read_csv(f, sep='\t', parse_dates=False) for f in glob.iglob(inpath)], axis=0)

dfcols = merged.columns.tolist()
firstcols = ['VAR', 'CHROM', 'POS', 'REF', 'ALT'] #preserve order of first 5 columns only

for d in set(dfcols):
    if d not in firstcols:
        firstcols.append(d)

merged.to_csv(str(outfile), sep="\t", header=True, index=False, na_rep="NA", cols=firstcols)

以下是截断输出的一些内容,用于说明标头的匹配方式:

> more merged.out 
VAR     CHROM   POS     REF     ALT     AC      AF      IUC     PU      P       RFG     IRF     IPU     IGN     GI
NA      NA      T       1       NA      NONE(dist=NONE) .       intergenic      ncRNA   0.1383  12892   NA      TGG     NA      1:12892:TGG:T
NA      NA      C       1       NA      WASH7P  .       ncRNA   ncRNA   0.5863  14397   NA      CTGT    NA      1:14397:CTGT:C

但是,列在Python环境中看起来是否正确注释?我很难过

>>> merged['VAR']
0      1:12892:TGG:T
1     1:14397:CTGT:C
2      1:17084:GGT:G
3      1:17421:ATG:A
0      1:14567:G:GAT
1       1:14670:TG:T
2      1:14745:GGC:G
3       1:14905:GA:G
0    21:10862612:G:A
1    21:10862618:T:C
2    21:10862623:T:C
3    21:10862624:G:T

0 个答案:

没有答案