我仍然在研究如何使用r ...来最好地操纵数据。
如果可能的话,我想用tidyr或dplyr包/函数来做这件事。
我有一些数据如下:
cost <- tibble(
v1=c('some text1','group1','name','date',c(runif(3),NA,NA,NA)),
v2=c('some text1','group1','name','value',c(runif(3),NA,NA,NA)),
v3=c('some text2','group1','name2','date',c(runif(4),NA,NA)),
v4=c('some text2','group1','name2','value',c(runif(4),NA,NA)),
v5=c('some text3','group2','name3','date',runif(6)),
v6=c('some text3','group2','name3','value',runif(6))
)
cost[] <- lapply( cost, factor)
> glimpse(cost)
Observations: 10
Variables: 6
$ v1 <fctr> some text1, group1, name, date, 0.924267573514953, 0.203127129469067, 0.0484973937273026, NA, NA, NA
$ v2 <fctr> some text1, group1, name, value, 0.712983385194093, 0.994925277773291, 0.0975768479984254, NA, NA, NA
$ v3 <fctr> some text2, group1, name2, date, 0.188781834673136, 0.859566977713257, 0.739685433451086, 0.2719707184...
$ v4 <fctr> some text2, group1, name2, value, 0.416961463401094, 0.558401603251696, 0.334375116974115, 0.195782373...
$ v5 <fctr> some text3, group2, name3, date, 0.857840840239078, 0.545017473166808, 0.209725016728044, 0.5044016360...
$ v6 <fctr> some text3, group2, name3, value, 0.551554219797254, 0.529705551918596, 0.258927160175517, 0.517376250...
我想让它更长,更宽。这是前3行,移入3列,并将其值与该名称的数据重叠。
我还想删除缺失值,它们可能不是NA,看起来它们只是空的(这些是read.csv
来自CSV文件)
> cost <- tibble(
name=c('name', 'name','name'),
desc=c('some text1', 'some text1', 'some text1'),
group=c('group2', 'group2', 'group2'),
date=c('dd-mm-yy', 'dd-mm-yy', 'dd-mm-yy'),
value=c(runif(1), runif(1), runif(1))
)
> cost
# A tibble: 3 x 5
name desc group date value
<chr> <chr> <chr> <chr> <dbl>
1 name some text1 group2 dd-mm-yy 0.04565986
2 name some text1 group2 dd-mm-yy 0.82689013
3 name some text1 group2 dd-mm-yy 0.67433167
答案 0 :(得分:1)
我想我理解你想要的结果。但是,如果我错了,请模拟您实际上想要从这些数据中获取的内容。
为此,我使用了tidyr
中的几个函数。
首先,我将创建新的列标签,折叠您拥有的四行标题信息。基本上,我占用前四行,将每个行折叠成一个单独的字符串(使用三个下划线间隔,这在您的实际数据中不太可能发生),然后将data.frame转换为字符向量。
myColNames <-
cost[1:4,] %>%
summarise_all(paste, collapse = "___") %>%
c %>%
unlist
生成
v1 v2 v3
"some text1___group1___name___date" "some text1___group1___name___value" "some text2___group1___name2___date"
v4 v5 v6
"some text2___group1___name2___value" "some text3___group2___name3___date" "some text3___group2___name3___value"
接下来,我可以删除用于创建列名的行,并将名称作为列插入(使用setNames
)。接下来,我添加了一个索引来将行上的数据链接在一起(当它来自同一个头时)。然后,我可以gather
将数据集转换为长格式,并将separate
我创建的列标题放入其组件中。最后,我可以spread
将日期和值条目分成不同的列(由于rowIdx
而在同一行上匹配)并过滤掉缺少的观察结果。
cost[-(1:4), ] %>%
## This step is only necessary if the data
## were imported as factors instead of as character
mutate_all(funs(as.character)) %>%
setNames(myColNames) %>%
mutate(rowIdx = 1:n()) %>%
gather(key, tempVal, -rowIdx) %>%
separate(key, c("Text", "Group", "Name", "toSpread"), sep = "___") %>%
spread(toSpread, tempVal) %>%
filter(!is.na(date))
返回
rowIdx Text Group Name date value
<int> <chr> <chr> <chr> <chr> <chr>
1 1 some text1 group1 name 0.601032865699381 0.320803644834086
2 1 some text2 group1 name2 0.755003974540159 0.724728998960927
3 1 some text3 group2 name3 0.782037091907114 0.642663416918367
4 2 some text1 group1 name 0.0365895153954625 0.131514045642689
5 2 some text2 group1 name2 0.0913304232526571 0.198074621148407
6 2 some text3 group2 name3 0.690302846953273 0.915490478742868
7 3 some text1 group1 name 0.912119234679267 0.474282702198252
8 3 some text2 group1 name2 0.909885906847194 0.125321796629578
9 3 some text3 group2 name3 0.883244396885857 0.850464047864079
10 4 some text2 group1 name2 0.894993636989966 0.443535323021933
11 4 some text3 group2 name3 0.674304561689496 0.823389955097809
12 5 some text3 group2 name3 0.700140621513128 0.458009321708232
13 6 some text3 group2 name3 0.19869831786491 0.00457167089916766