使用python(或R)清理丢失数据的表

时间:2013-01-10 23:05:58

标签: python r

我有一个像这样组织的表格(curves.csv)(没有组织会更好的描述)

CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
A,5,k,B,6,l,C,5,m,D,8,n,E,5,o

我想将此表格转换为

,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,

我目前有这个:

celllines=["A","B","C","D","E"]
sorted_days=["1","2","3","4","5","8"]
for d in sorted_days:
    curves=open("curves.csv","rU")
    for line in curves:
        line=line.rstrip().rsplit(",")
        if line[0]!="CL":#removes header
            for x in range(0,len(line),3):
                if line[x] in celllines:
                    if line[x+1] == d:
                        print d,line[x],line[x+2]
                    else:
                        print d, line[x],""



    curves.close()

我只是觉得我正在进一步形成答案,而不是更接近! 任何指针都将一如既往地受到赞赏

5 个答案:

答案 0 :(得分:2)

使用csv模块:

这样的事情怎么样?
import csv

# make a dictionary to store the data
data = {}

# first, read it in
with open("curves.csv", "rb") as fp:

    # make a csv reader object
    reader = csv.reader(fp)

    # skip initial line
    next(reader)

    for row in reader:
        # for each triplet, store it in the dictionary
        for i in range(len(row)//3):
            CL, D, PD = row[3*i:3*i+3]
            data[D, CL] = PD

# see what we've got
print data

with open("newcurves.csv", "wb") as fp:
    # get the labels in order
    row_labels = sorted(set(k[0] for k in data), key=int)
    col_labels = sorted(set(k[1] for k in data))

    writer = csv.writer(fp)
    # write header
    writer.writerow([''] + col_labels)

    # write data rows
    for row_label in row_labels:
        # start with the label
        row = [row_label]

        # then extend a list of the data in order, using the empty string '' if
        # there's no such value
        row.extend([data.get((row_label, col_label), '') for col_label in col_labels])

        # dump it out
        writer.writerow(row)

给了我们一个类似

的字典
{('1', 'D'): 'd', ('1', 'E'): 'e', ('5', 'C'): 'm', ('1', 'B'): 'b', ('2', 'E'): 'j', ('1', 'C'): 'c', ('5', 'A'): 'k', ('6', 'B'): 'l', ('2', 'C'): 'h', ('1', 'A'): 'a', ('4', 'D'): 'i', ('8', 'D'): 'n', ('2', 'A'): 'f', ('3', 'B'): 'g', ('5', 'E'): 'o'}

之类的输出文件
~/coding$ cat newcurves.csv 
,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,

答案 1 :(得分:2)

我发现,解决这类问题的最佳方法是分解旧格式的分解和新格式的构建。相反,将旧格式分解为一个理智的数据结构,以便在Python中轻松使用数据,然后使用这种漂亮,可塑的结构构建新格式。

无论我们使用逗号分隔值,我们都可以使用标准库中的the csv module来简化它,并大大简化了这种工作。

这个解决方案也大量使用the list comprehension (and it's various cousins),所以如果你不熟悉它们,我建议你阅读一下(之前链接的是我的短视频解释)。

import csv
import itertools

def grouper(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)

with open("curves.csv") as file:
    data = csv.reader(file)
    next(data) #Ignore header row.
    parsed = {(column, row): value for line in data
              for column, row, value in grouper(3, line)}

rows = sorted({row for (_, row) in parsed})
columns = sorted({column for (column, _) in parsed})

with open("output.csv", "w") as file:
    writer = csv.writer(file)
    writer.writerow([None] + columns)
    writer.writerows([[row]+[parsed.get((column, row))
                             for column in columns]
                      for row in rows])

我们首先使用with语句打开文件(确保文件关闭的最佳做法),然后我们跳过标题行,并解析数据。为此,我们获取数据中的每一行,然后将该行分组为长度为3的块(使用grouper()函数,即an itertools recipie)。这为我们提供了列,行和值,然后我们将其用作字典的键和值。

这为我们提供了{("A", 1): "a", ...}的字典。这是一种很好的格式,所以现在我们将文件构建回所需的格式。

首先我们需要知道我们需要哪些行和列,我们只需要从解析数据中取出行,然后创建一个集合(因为集合不能包含重复项),最后将它们排序回列表中所以我们有正确的订单。

然后我们打开输出文件,并将列写入其中(记住为行标题列添加None),然后写出我们的数据。对于每一行,我们编写行号,然后使用dict.get()从我们的解析数据中获取每列的值,因此如果没有值,我们会得到None。这给出了想要的输出。

作为一个注释:看起来你在问题中使用的是Python 2.x,我的答案是用3.x编写的。唯一的区别应该是itertools.zip_longest()在{3.}中是itertools.izip_longest()

答案 2 :(得分:2)

只是为了表明(有点晚)它也可以在R:

中完成
curves <- read.csv("curves.csv", as.is = TRUE)
stack  <- data.frame(CL = unlist(curves[, c(TRUE, FALSE, FALSE)]),
                     D  = unlist(curves[, c(FALSE, TRUE, FALSE)]),
                     PD = unlist(curves[, c(FALSE, FALSE, TRUE)]),
                     stringsAsFactors = FALSE)
library(reshape2)
output <- acast(stack, D ~ CL, value.var = "PD", fill = "")
write.csv(output, "new_curves.csv", quote = FALSE)

如果您不想使用第三方软件包,那么您可以使用base:

完成所有操作
curves   <- read.csv("curves.csv", as.is = TRUE)
rownames <- sort(unique(unlist(curves[, c(FALSE, TRUE, FALSE)])))
colnames <- sort(unique(unlist(curves[, c(TRUE, FALSE, FALSE)])))
output   <- matrix("", nrow = length(rownames), ncol = length(colnames),
                       dimnames = list(rownames, colnames))
fill.i   <- match(unlist(curves[, c(FALSE, TRUE, FALSE)]), rownames)
fill.j   <- match(unlist(curves[, c(TRUE, FALSE, FALSE)]), colnames)
fill.x   <- unlist(curves[, c(FALSE, FALSE, TRUE)])
output[cbind(fill.i, fill.j)] <- fill.x
write.csv(output, "new_curves.csv", quote = FALSE)

答案 3 :(得分:1)

不使用csv模块:

celllines=["","A","B","C","D","E"]
days=["1","2","3","4","5","6","7","8"]

curves = sum([line.split(',') for line in open("curves.csv","rU").read().split()[1:]], [])

group = {(d,cl): pd for (cl,d,pd) in [curves[i:i+3] for i in range(0,len(curves),3)]}
table = [[d if not x else '' for x in celllines] for d in days]

for (d,cl),pd in group.items():
    table[days.index(d)][celllines.index(cl)] = pd

with open("curves2.csv", "w") as f:
    f.write('\n'.join(','.join(line) for line in [celllines]+table))

答案 4 :(得分:1)

带有tapply的R解决方案 - 连接函数c。

cvrs <- read.table(text="CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
 A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
 A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
 A,5,k,B,6,l,C,5,m,D,8,n,E,5,o", header=TRUE, sep=",", check.names=FALSE)

long <- rbind(crvs[, 1:3], crvs[, 4:6], crvs[, 7:9], crvs[, 10:12])
out <- with( long, tapply(PD, list(D, CL), FUN=c) )
#-----------------
 write.table(out, quote=FALSE, sep=",", na="")
A,B,C,D
1,a,b,c,d
2,f,,h,
3,,g,,
4,,,,i
5,k,,m,
6,,l,,
8,,,,n