广泛的数据来整理和长期

时间:2017-12-20 05:26:28

标签: r dplyr tidyr

我仍然在研究如何使用r ...来最好地操纵数据。

如果可能的话,我想用tidyr或dplyr包/函数来做这件事。

我有一些数据如下:

cost <- tibble(
  v1=c('some text1','group1','name','date',c(runif(3),NA,NA,NA)), 
  v2=c('some text1','group1','name','value',c(runif(3),NA,NA,NA)), 
  v3=c('some text2','group1','name2','date',c(runif(4),NA,NA)), 
  v4=c('some text2','group1','name2','value',c(runif(4),NA,NA)),
  v5=c('some text3','group2','name3','date',runif(6)), 
  v6=c('some text3','group2','name3','value',runif(6))
  )
cost[] <- lapply( cost, factor)
> glimpse(cost)
Observations: 10
Variables: 6
$ v1 <fctr> some text1, group1, name, date, 0.924267573514953, 0.203127129469067, 0.0484973937273026, NA, NA, NA
$ v2 <fctr> some text1, group1, name, value, 0.712983385194093, 0.994925277773291, 0.0975768479984254, NA, NA, NA
$ v3 <fctr> some text2, group1, name2, date, 0.188781834673136, 0.859566977713257, 0.739685433451086, 0.2719707184...
$ v4 <fctr> some text2, group1, name2, value, 0.416961463401094, 0.558401603251696, 0.334375116974115, 0.195782373...
$ v5 <fctr> some text3, group2, name3, date, 0.857840840239078, 0.545017473166808, 0.209725016728044, 0.5044016360...
$ v6 <fctr> some text3, group2, name3, value, 0.551554219797254, 0.529705551918596, 0.258927160175517, 0.517376250... 

我想让它更长,更宽。这是前3行,移入3列,并将其值与该名称的数据重叠。 我还想删除缺失值,它们可能不是NA,看起来它们只是空的(这些是read.csv来自CSV文件)

> cost <- tibble(
    name=c('name', 'name','name'),
    desc=c('some text1', 'some text1', 'some text1'),
    group=c('group2', 'group2', 'group2'),
    date=c('dd-mm-yy', 'dd-mm-yy', 'dd-mm-yy'),
    value=c(runif(1), runif(1), runif(1)) 
   )
> cost
# A tibble: 3 x 5
   name       desc  group     date      value
  <chr>      <chr>  <chr>    <chr>      <dbl>
1  name some text1 group2 dd-mm-yy 0.04565986
2  name some text1 group2 dd-mm-yy 0.82689013
3  name some text1 group2 dd-mm-yy 0.67433167

1 个答案:

答案 0 :(得分:1)

我想我理解你想要的结果。但是,如果我错了,请模拟您实际上想要从这些数据中获取的内容。

为此,我使用了tidyr中的几个函数。

首先,我将创建新的列标签,折叠您拥有的四行标题信息。基本上,我占用前四行,将每个行折叠成一个单独的字符串(使用三个下划线间隔,这在您的实际数据中不太可能发生),然后将data.frame转换为字符向量。

myColNames <-
  cost[1:4,] %>%
  summarise_all(paste, collapse = "___") %>%
  c %>%
  unlist

生成

                                   v1                                    v2                                    v3 
  "some text1___group1___name___date"  "some text1___group1___name___value"  "some text2___group1___name2___date" 
                                   v4                                    v5                                    v6 
"some text2___group1___name2___value"  "some text3___group2___name3___date" "some text3___group2___name3___value" 

接下来,我可以删除用于创建列名的行,并将名称作为列插入(使用setNames)。接下来,我添加了一个索引来将行上的数据链接在一起(当它来自同一个头时)。然后,我可以gather将数据集转换为长格式,并将separate我创建的列标题放入其组件中。最后,我可以spread将日期和值条目分成不同的列(由于rowIdx而在同一行上匹配)并过滤掉缺少的观察结果。

cost[-(1:4), ] %>%
  ## This step is only necessary if the data
  ## were imported as factors instead of as character
  mutate_all(funs(as.character)) %>%
  setNames(myColNames) %>%
  mutate(rowIdx = 1:n()) %>%
  gather(key, tempVal, -rowIdx) %>%
  separate(key, c("Text", "Group", "Name", "toSpread"), sep = "___") %>%
  spread(toSpread, tempVal) %>%
  filter(!is.na(date))

返回

   rowIdx       Text  Group  Name               date               value
    <int>      <chr>  <chr> <chr>              <chr>               <chr>
 1      1 some text1 group1  name  0.601032865699381   0.320803644834086
 2      1 some text2 group1 name2  0.755003974540159   0.724728998960927
 3      1 some text3 group2 name3  0.782037091907114   0.642663416918367
 4      2 some text1 group1  name 0.0365895153954625   0.131514045642689
 5      2 some text2 group1 name2 0.0913304232526571   0.198074621148407
 6      2 some text3 group2 name3  0.690302846953273   0.915490478742868
 7      3 some text1 group1  name  0.912119234679267   0.474282702198252
 8      3 some text2 group1 name2  0.909885906847194   0.125321796629578
 9      3 some text3 group2 name3  0.883244396885857   0.850464047864079
10      4 some text2 group1 name2  0.894993636989966   0.443535323021933
11      4 some text3 group2 name3  0.674304561689496   0.823389955097809
12      5 some text3 group2 name3  0.700140621513128   0.458009321708232
13      6 some text3 group2 name3   0.19869831786491 0.00457167089916766