我有一个如下所示的CSV文件:
Names, Size, State, time1, time2,
S1, 22, MD , 0.022, , 523.324
S2, 22, MD , 4.32, , 342.54
S3, 22, MD , 3.54, , 0.32
S4, 22, MD , 4.32, , 0.54
S1, 33, MD , 5.32, , 0.43
S2, 33, MD , 11.54, , 0.65
S3, 33, MD , 22.5, , 0.324
S4, 33, MD , 45.89 , 0.32
S1, 44, MD , 3.53 , 3.32
S2, 44, MD , 4.5 , 0.322
S3, 44, MD , 43.65 , 45.78
S4, 44, MD, 43.54 , 0.321
我不关心州列
我的输出文件需要如下所示:
Size , S1` , S2 , S3 , S4
22 , 0.022 , 4.32 , 45.89 , 4.32
33 , 5.32, 11.54 , 22.5, 45.89,
44 , 3.53, 4.5, 43.65, 43.54
3 values, 3 values, 3,values, 3 values
如您所见,输出文件包含不同的标头,这些标头是第一个csv文件中的值。 csv文件按大小列排序。换句话说,我想知道哪个时间与每个文件的大小相关联(S1,S2,S3,S4)。列的顺序也会改变。 size列现在是输出文件中的第一列。最后一行也表示每列中的值总数。
到目前为止我的代码:
import pandas as pd
import numpy as np
import csv
df=pd.read_csv(r'C:\Users\testuser\Desktop\file.csv',usecols=[0,1,2,3,4])
df.columns=pd.MultiIndex.from_tuples(zip(['Names','FileSize','x','y','z'],df.columns)) *#add column headers... (this did not do it correctly)*
df_out=df.groupby('Names','FileSize').count().reset_index() *#suppose to print distinct values*
df_out.to_csv('processed_data_out.csv', columns['Names','FileSize','x','y','z'], header=False,index=False)
我知道我没有使用最后一列time2
,因为我不知道如何添加它,以便用户可以知道什么时间(time1和time2)与大小相关联。
答案 0 :(得分:2)
import csv
import sys
filename = sys.argv[1]
with open(filename, 'rb') as csvfile:
reader = csv.reader(csvfile)
data = {}
next(reader, None) # skip the headers
for row in reader:
size = int(row[1])
time1 = float(row[3])
if not size in data:
data[size] = []
data[size].append(time1)
writer = csv.writer(sys.stdout)
writer.writerow(["Size","S1","S2","S3","S4"])
for item in data:
row = [item]
row.extend(data[item])
writer.writerow(row)
,因为你已经在使用python了,我会继续使用python:
convert.py:
python convert.py C:\Users\testuser\Desktop\file.csv
这样称呼:
Size,S1,S2,S3,S4
33,5.32,11.54,22.5,45.89
44,3.53,4.5,43.65,43.54
22,0.022,4.32,3.54,4.32
输出:
awk
顺便说一下,awk -F'[, ]*' '
NR>1{
a[$2]=a[$2]","$4
}
END{
for(i in a){
print i""a[i]
}
}' input.csv
解决方案可能如下所示:
/text()
答案 1 :(得分:0)
要求救援
awk -F, -f table.awk
其中
$ cat table.awk
NR == 1 {
h = $1 # save header
next
}
NR == 2 {
p = $2 # to match blocks
v = $2 # value accumulator
}
p == $2 { # we're in the same block
v = v FS $4 # start accumulate values
if (h != "") { # if we're not done with header
h = h FS $1 # accumulate header values
}
}
p != $2 { # we're in a new block
if (h != "") { # if not printed yet, print header
print h
h = "" # and reset
}
print v # print values
p = $2 # set new block indicator
v = $2 FS $4 # accumulate values
}
END {
print v # for the final block print values
}
测试
awk -F, -f table.awk << !
> Names, Size, State, time1, time2,
> S1, 22, MD , 0.022, , 523.324
> S2, 22, MD , 4.32, , 342.54
> S3, 22, MD , 3.54, , 0.32
> S4, 22, MD , 4.32, , 0.54
> S1, 33, MD , 5.32, , 0.43
> S2, 33, MD , 11.54, , 0.65
> S3, 33, MD , 22.5, , 0.324
> S4, 33, MD , 45.89 , 0.32
> S1, 44, MD , 3.53 , 3.32
> S2, 44, MD , 4.5 , 0.322
> S3, 44, MD , 43.65 , 45.78
> S4, 44, MD, 43.54 , 0.321
> !
Names,S1,S2,S3,S4
22, 0.022, 4.32, 3.54, 4.32
33, 5.32, 11.54, 22.5, 45.89
44, 3.53 , 4.5 , 43.65 , 43.54
答案 2 :(得分:0)
我喜欢这两种awk解决方案背后的想法,但是对于希望使用awk的中间风格不那么简洁并且看起来更像其他脚本解决方案的人们,请考虑以下问题:
BEGIN {
while ("cat data1" | getline) {
if ($0 ~ /S[1-4]/) {
split($0,temp,/[ ,]+/)
oline[temp[2]] = oline[temp[2]] " , " temp[4]
}
}
print "Size , S1 , S2 , S3 , S4"
for (i in oline) print i oline[i]
}
OUTPUT:
Size , S1 , S2 , S3 , S4
22 , 0.022 , 4.32 , 3.54 , 4.32
33 , 5.32 , 11.54 , 22.5 , 45.89
44 , 3.53 , 4.5 , 43.65 , 43.54
如果数据的行顺序不是很好,则可以使用“ sort -nk2 -k1”代替“ cat”以确保其对行重新排序具有鲁棒性。仍然采用S1-S4行命名。