Question

问题如下：我们有一个csv文件，其中包含一些异常的数据格式。 R是巨大的，我确实缺少一些简短的解决方案。

给定一个文件我们读取它并获得下面表格的数据框：

# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03

有没有简短的方法将其转换为此数据帧：

id      file topic proportion
 0 file1.txt     0       0.01
 1 file2.txt     0       0.01
 1 file2.txt     1       0.03

我们有不变的列数？主题 - 比例对的数量没有定义，可能非常大。谢谢！

Answer 1

有一种方法可以继续。我想data包含保存为.csv文件的文件路径：

library(plyr)

df        = read.csv(data)
names     = c("id","file","topic","proportion")
extractDF = function(u) setNames(df[,c(1,2,u,u+1)], names)

newDF = ldply(seq(3,length(df)-1,by=2), extractDF)

newDF[complete.cases(newDF),]

#  id      file topic proportion
#1  0 file1.txt     0       0.01
#2  1 file2.txt     0       0.01
#4  1 file2.txt     1       0.03

数据如下，以csv格式保存：

# id, file, topic, proportion, [topic, proportion]* 
0,file1.txt,0,0.01 
1,file2.txt,0,0.01,1,0.03

Answer 2

您可以在我的＆＃34; splitstackshape＆＃34;中尝试merged.stack封装

假设这是你的起始数据......

mydf <- read.table(
  text = "id, file, topic, proportion, topic, proportion
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03", 
  header = TRUE, sep = ",", fill = TRUE) 
mydf
#   id      file topic proportion topic.1 proportion.1
# 1  0 file1.txt     0       0.01      NA           NA
# 2  1 file2.txt     0       0.01       1         0.03

你只需要做....

library(splitstackshape)
merged.stack(mydf, var.stubs = c("topic", "proportion"), 
             sep = "var.stubs")[, .time_1 := NULL][]
#    id      file topic proportion
# 1:  0 file1.txt     0       0.01
# 2:  0 file1.txt    NA         NA
# 3:  1 file2.txt     0       0.01
# 4:  1 file2.txt     1       0.03

如果您不想要包含na.omit值的行，请将整个内容包裹在NA中。

na.omit(
  merged.stack(mydf, var.stubs = c("topic", "proportion"), 
               sep = "var.stubs")[, .time_1 := NULL])
#    id      file topic proportion
# 1:  0 file1.txt     0       0.01
# 2:  1 file2.txt     0       0.01
# 3:  1 file2.txt     1       0.03

是否有惯用的R方式来规范化数据帧？

2 个答案: