我有一个像这样组织的表格(curves.csv)(没有组织会更好的描述)
CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
A,5,k,B,6,l,C,5,m,D,8,n,E,5,o
我想将此表格转换为
,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,
我目前有这个:
celllines=["A","B","C","D","E"]
sorted_days=["1","2","3","4","5","8"]
for d in sorted_days:
curves=open("curves.csv","rU")
for line in curves:
line=line.rstrip().rsplit(",")
if line[0]!="CL":#removes header
for x in range(0,len(line),3):
if line[x] in celllines:
if line[x+1] == d:
print d,line[x],line[x+2]
else:
print d, line[x],""
curves.close()
我只是觉得我正在进一步形成答案,而不是更接近! 任何指针都将一如既往地受到赞赏
答案 0 :(得分:2)
使用csv
模块:
import csv
# make a dictionary to store the data
data = {}
# first, read it in
with open("curves.csv", "rb") as fp:
# make a csv reader object
reader = csv.reader(fp)
# skip initial line
next(reader)
for row in reader:
# for each triplet, store it in the dictionary
for i in range(len(row)//3):
CL, D, PD = row[3*i:3*i+3]
data[D, CL] = PD
# see what we've got
print data
with open("newcurves.csv", "wb") as fp:
# get the labels in order
row_labels = sorted(set(k[0] for k in data), key=int)
col_labels = sorted(set(k[1] for k in data))
writer = csv.writer(fp)
# write header
writer.writerow([''] + col_labels)
# write data rows
for row_label in row_labels:
# start with the label
row = [row_label]
# then extend a list of the data in order, using the empty string '' if
# there's no such value
row.extend([data.get((row_label, col_label), '') for col_label in col_labels])
# dump it out
writer.writerow(row)
给了我们一个类似
的字典{('1', 'D'): 'd', ('1', 'E'): 'e', ('5', 'C'): 'm', ('1', 'B'): 'b', ('2', 'E'): 'j', ('1', 'C'): 'c', ('5', 'A'): 'k', ('6', 'B'): 'l', ('2', 'C'): 'h', ('1', 'A'): 'a', ('4', 'D'): 'i', ('8', 'D'): 'n', ('2', 'A'): 'f', ('3', 'B'): 'g', ('5', 'E'): 'o'}
和
之类的输出文件~/coding$ cat newcurves.csv
,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,
答案 1 :(得分:2)
我发现,解决这类问题的最佳方法是分解旧格式的分解和新格式的构建。相反,将旧格式分解为一个理智的数据结构,以便在Python中轻松使用数据,然后使用这种漂亮,可塑的结构构建新格式。
无论我们使用逗号分隔值,我们都可以使用标准库中的the csv
module来简化它,并大大简化了这种工作。
这个解决方案也大量使用the list comprehension (and it's various cousins),所以如果你不熟悉它们,我建议你阅读一下(之前链接的是我的短视频解释)。
import csv
import itertools
def grouper(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
return itertools.zip_longest(fillvalue=fillvalue, *args)
with open("curves.csv") as file:
data = csv.reader(file)
next(data) #Ignore header row.
parsed = {(column, row): value for line in data
for column, row, value in grouper(3, line)}
rows = sorted({row for (_, row) in parsed})
columns = sorted({column for (column, _) in parsed})
with open("output.csv", "w") as file:
writer = csv.writer(file)
writer.writerow([None] + columns)
writer.writerows([[row]+[parsed.get((column, row))
for column in columns]
for row in rows])
我们首先使用with
语句打开文件(确保文件关闭的最佳做法),然后我们跳过标题行,并解析数据。为此,我们获取数据中的每一行,然后将该行分组为长度为3的块(使用grouper()
函数,即an itertools
recipie)。这为我们提供了列,行和值,然后我们将其用作字典的键和值。
这为我们提供了{("A", 1): "a", ...}
的字典。这是一种很好的格式,所以现在我们将文件构建回所需的格式。
首先我们需要知道我们需要哪些行和列,我们只需要从解析数据中取出行,然后创建一个集合(因为集合不能包含重复项),最后将它们排序回列表中所以我们有正确的订单。
然后我们打开输出文件,并将列写入其中(记住为行标题列添加None
),然后写出我们的数据。对于每一行,我们编写行号,然后使用dict.get()
从我们的解析数据中获取每列的值,因此如果没有值,我们会得到None
。这给出了想要的输出。
作为一个注释:看起来你在问题中使用的是Python 2.x,我的答案是用3.x编写的。唯一的区别应该是itertools.zip_longest()
在{3.}中是itertools.izip_longest()
。
答案 2 :(得分:2)
只是为了表明(有点晚)它也可以在R:
中完成curves <- read.csv("curves.csv", as.is = TRUE)
stack <- data.frame(CL = unlist(curves[, c(TRUE, FALSE, FALSE)]),
D = unlist(curves[, c(FALSE, TRUE, FALSE)]),
PD = unlist(curves[, c(FALSE, FALSE, TRUE)]),
stringsAsFactors = FALSE)
library(reshape2)
output <- acast(stack, D ~ CL, value.var = "PD", fill = "")
write.csv(output, "new_curves.csv", quote = FALSE)
如果您不想使用第三方软件包,那么您可以使用base:
完成所有操作curves <- read.csv("curves.csv", as.is = TRUE)
rownames <- sort(unique(unlist(curves[, c(FALSE, TRUE, FALSE)])))
colnames <- sort(unique(unlist(curves[, c(TRUE, FALSE, FALSE)])))
output <- matrix("", nrow = length(rownames), ncol = length(colnames),
dimnames = list(rownames, colnames))
fill.i <- match(unlist(curves[, c(FALSE, TRUE, FALSE)]), rownames)
fill.j <- match(unlist(curves[, c(TRUE, FALSE, FALSE)]), colnames)
fill.x <- unlist(curves[, c(FALSE, FALSE, TRUE)])
output[cbind(fill.i, fill.j)] <- fill.x
write.csv(output, "new_curves.csv", quote = FALSE)
答案 3 :(得分:1)
不使用csv
模块:
celllines=["","A","B","C","D","E"]
days=["1","2","3","4","5","6","7","8"]
curves = sum([line.split(',') for line in open("curves.csv","rU").read().split()[1:]], [])
group = {(d,cl): pd for (cl,d,pd) in [curves[i:i+3] for i in range(0,len(curves),3)]}
table = [[d if not x else '' for x in celllines] for d in days]
for (d,cl),pd in group.items():
table[days.index(d)][celllines.index(cl)] = pd
with open("curves2.csv", "w") as f:
f.write('\n'.join(','.join(line) for line in [celllines]+table))
答案 4 :(得分:1)
带有tapply
的R解决方案 - 连接函数c。
cvrs <- read.table(text="CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
A,5,k,B,6,l,C,5,m,D,8,n,E,5,o", header=TRUE, sep=",", check.names=FALSE)
long <- rbind(crvs[, 1:3], crvs[, 4:6], crvs[, 7:9], crvs[, 10:12])
out <- with( long, tapply(PD, list(D, CL), FUN=c) )
#-----------------
write.table(out, quote=FALSE, sep=",", na="")
A,B,C,D
1,a,b,c,d
2,f,,h,
3,,g,,
4,,,,i
5,k,,m,
6,,l,,
8,,,,n