使用Python和Pandas(AWK)重新格式化CSV文件?

时间:2015-08-17 03:57:34

标签: python-2.7 csv pandas awk

我有一个如下所示的CSV文件:

Names, Size, State, time1,   time2,       
S1,    22,   MD  , 0.022, ,  523.324
S2,    22,   MD  , 4.32,  , 342.54 
S3,    22,   MD  , 3.54,  ,   0.32
S4,    22,   MD  , 4.32,  ,  0.54  
S1,    33,   MD  , 5.32,  ,  0.43
S2,    33,   MD  , 11.54, ,  0.65
S3,    33,   MD  , 22.5,  ,  0.324
S4,    33,   MD  , 45.89  ,  0.32
S1,    44,  MD  , 3.53   ,  3.32
S2,    44,  MD  ,  4.5   ,  0.322
S3,    44,  MD  , 43.65  ,   45.78
S4,    44,   MD,   43.54 , 0.321

我不关心

我的输出文件需要如下所示:

 Size ,   S1` ,    S2  ,   S3  ,   S4   

  22   ,  0.022 ,  4.32 ,  45.89 ,  4.32

  33  ,  5.32,    11.54 ,  22.5,   45.89, 

  44  ,  3.53,    4.5,     43.65,  43.54

        3 values, 3 values, 3,values, 3 values

如您所见,输出文件包含不同的标头,这些标头是第一个csv文件中的值。 csv文件按大小列排序。换句话说,我想知道哪个时间与每个文件的大小相关联(S1,S2,S3,S4)。列的顺序也会改变。 size列现在是输出文件中的第一列。最后一行也表示每列中的值总数。

到目前为止我的代码:

import pandas as pd
import numpy as np
import csv

df=pd.read_csv(r'C:\Users\testuser\Desktop\file.csv',usecols=[0,1,2,3,4])
df.columns=pd.MultiIndex.from_tuples(zip(['Names','FileSize','x','y','z'],df.columns)) *#add column headers... (this did not do it correctly)*
df_out=df.groupby('Names','FileSize').count().reset_index() *#suppose to print distinct values*
df_out.to_csv('processed_data_out.csv', columns['Names','FileSize','x','y','z'], header=False,index=False)

我知道我没有使用最后一列time2,因为我不知道如何添加它,以便用户可以知道什么时间(time1和time2)与大小相关联。

3 个答案:

答案 0 :(得分:2)

这里没有必要

import csv import sys filename = sys.argv[1] with open(filename, 'rb') as csvfile: reader = csv.reader(csvfile) data = {} next(reader, None) # skip the headers for row in reader: size = int(row[1]) time1 = float(row[3]) if not size in data: data[size] = [] data[size].append(time1) writer = csv.writer(sys.stdout) writer.writerow(["Size","S1","S2","S3","S4"]) for item in data: row = [item] row.extend(data[item]) writer.writerow(row) ,因为你已经在使用python了,我会继续使用python:

convert.py:

python convert.py C:\Users\testuser\Desktop\file.csv

这样称呼:

Size,S1,S2,S3,S4
33,5.32,11.54,22.5,45.89
44,3.53,4.5,43.65,43.54
22,0.022,4.32,3.54,4.32

输出:

awk

顺便说一下,awk -F'[, ]*' ' NR>1{ a[$2]=a[$2]","$4 } END{ for(i in a){ print i""a[i] } }' input.csv 解决方案可能如下所示:

/text()

答案 1 :(得分:0)

要求救援

awk -F, -f table.awk

其中

$ cat table.awk

    NR == 1 {
            h = $1           # save header
            next
    }

    NR == 2 {
            p = $2           # to match blocks
            v = $2           # value accumulator
    }

    p == $2 {                # we're in the same block
            v = v FS $4      # start accumulate values
            if (h != "") {   # if we're not done with header
                    h = h FS $1    # accumulate header values
            }
    }

    p != $2 {                # we're in a new block
            if (h != "") {   # if not printed yet, print header
                    print h
                    h = ""   # and reset
            }
            print v          # print values
            p = $2           # set new block indicator
            v = $2 FS $4     # accumulate values
    }

    END {
            print v          # for the final block print values
    }

测试

awk -F, -f table.awk << !
> Names, Size, State, time1,   time2,
> S1,    22,   MD  , 0.022, ,  523.324
> S2,    22,   MD  , 4.32,  , 342.54
> S3,    22,   MD  , 3.54,  ,   0.32
> S4,    22,   MD  , 4.32,  ,  0.54
> S1,    33,   MD  , 5.32,  ,  0.43
> S2,    33,   MD  , 11.54, ,  0.65
> S3,    33,   MD  , 22.5,  ,  0.324
> S4,    33,   MD  , 45.89  ,  0.32
> S1,    44,  MD  , 3.53   ,  3.32
> S2,    44,  MD  ,  4.5   ,  0.322
> S3,    44,  MD  , 43.65  ,   45.78
> S4,    44,   MD,   43.54 , 0.321
> !
Names,S1,S2,S3,S4
22, 0.022, 4.32, 3.54, 4.32
33, 5.32, 11.54, 22.5, 45.89
44, 3.53   ,  4.5   , 43.65  ,   43.54

答案 2 :(得分:0)

我喜欢这两种awk解决方案背后的想法,但是对于希望使用awk的中间风格不那么简洁并且看起来更像其他脚本解决方案的人们,请考虑以下问题:

BEGIN { 
  while ("cat data1" | getline) {
    if ($0 ~ /S[1-4]/) {
      split($0,temp,/[ ,]+/)
      oline[temp[2]] = oline[temp[2]] " ,  " temp[4]
    }
  }
  print "Size ,   S1 ,    S2  ,   S3  ,   S4"
  for (i in oline) print i oline[i]
}



OUTPUT:
Size ,   S1 ,    S2  ,   S3  ,   S4
22 ,  0.022 ,  4.32 ,  3.54 ,  4.32
33 ,  5.32 ,  11.54 ,  22.5 ,  45.89
44 ,  3.53 ,  4.5 ,  43.65 ,  43.54

如果数据的行顺序不是很好,则可以使用“ sort -nk2 -k1”代替“ cat”以确保其对行重新排序具有鲁棒性。仍然采用S1-S4行命名。