问题如下:我们有一个csv文件,其中包含一些异常的数据格式。 R是巨大的,我确实缺少一些简短的解决方案。
给定一个文件我们读取它并获得下面表格的数据框:
# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03
有没有简短的方法将其转换为此数据帧:
id file topic proportion
0 file1.txt 0 0.01
1 file2.txt 0 0.01
1 file2.txt 1 0.03
我们有不变的列数?主题 - 比例对的数量没有定义,可能非常大。谢谢!
答案 0 :(得分:1)
有一种方法可以继续。我想data
包含保存为.csv
文件的文件路径:
library(plyr)
df = read.csv(data)
names = c("id","file","topic","proportion")
extractDF = function(u) setNames(df[,c(1,2,u,u+1)], names)
newDF = ldply(seq(3,length(df)-1,by=2), extractDF)
newDF[complete.cases(newDF),]
# id file topic proportion
#1 0 file1.txt 0 0.01
#2 1 file2.txt 0 0.01
#4 1 file2.txt 1 0.03
数据如下,以csv
格式保存:
# id, file, topic, proportion, [topic, proportion]*
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03
答案 1 :(得分:0)
您可以在我的" splitstackshape"中尝试merged.stack
封装
假设这是你的起始数据......
mydf <- read.table(
text = "id, file, topic, proportion, topic, proportion
0,file1.txt,0,0.01
1,file2.txt,0,0.01,1,0.03",
header = TRUE, sep = ",", fill = TRUE)
mydf
# id file topic proportion topic.1 proportion.1
# 1 0 file1.txt 0 0.01 NA NA
# 2 1 file2.txt 0 0.01 1 0.03
你只需要做....
library(splitstackshape)
merged.stack(mydf, var.stubs = c("topic", "proportion"),
sep = "var.stubs")[, .time_1 := NULL][]
# id file topic proportion
# 1: 0 file1.txt 0 0.01
# 2: 0 file1.txt NA NA
# 3: 1 file2.txt 0 0.01
# 4: 1 file2.txt 1 0.03
如果您不想要包含na.omit
值的行,请将整个内容包裹在NA
中。
na.omit(
merged.stack(mydf, var.stubs = c("topic", "proportion"),
sep = "var.stubs")[, .time_1 := NULL])
# id file topic proportion
# 1: 0 file1.txt 0 0.01
# 2: 1 file2.txt 0 0.01
# 3: 1 file2.txt 1 0.03