R处理分层数据

时间:2018-04-23 03:02:04

标签: r tidyr

我有这个数据框,

DF

Area Areacode Value
Region1 NA 23
Area1   1 2
Area2   2 1
Area3   3 20
Region2 NA 14
Area1   1 10
Area4   4 4

我们如何在Area列中展平关系,因此输出应该如下,

Area AreaCode Region Value
Area1 1 Region1 2
Area2 2 Region1 1
Area3 3 Region1 20
Area1 1 Region2 10
Area4 4 Region2 4

对不起,忘了提一下,有些名字会包含区域文字。但要区别于区域,区域将不会有区号。

感谢。

3 个答案:

答案 0 :(得分:2)

这个怎么样?

library(tidyverse);
df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(Region = ifelse(str_detect(Area, "Region"), Area, NA)) %>%
    fill(Region) %>%
    filter(!str_detect(Area, "Region"))
#   Area Value  Region
#1 Area1     2 Region1
#2 Area2     1 Region1
#3 Area3    20 Region1
#4 Area1    10 Region2
#5 Area4     4 Region2

说明:使用Region中与"Region"匹配的条目创建新列Area。使用NANA替换为之前的非tidyr::fill条目,并删除"Region"列中与Area匹配的行。

样本数据

df <- read.table(text =
    "Area Value
Region1 23
Area1   2
Area2   1
Area3   20
Region2 14
Area1   10
Area4   4", header = T)

更新

根据您修改的样本数据我们可以做到:

df <- read.table(text =
    "Area Areacode Value
Region1 NA 23
Area1 1 2
'Area region2' 2 1
Area3 3 20
Region2 NA 14
'Area region1' 1 10
Area4 4 4", header = T)

df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(Region = ifelse(is.na(Areacode), Area, NA)) %>%
    fill(Region) %>%
    filter(!is.na(Areacode));
#          Area Areacode Value  Region
#1        Area1        1     2 Region1
#2 Area region2        2     1 Region1
#3        Area3        3    20 Region1
#4 Area region1        1    10 Region2
#5        Area4        4     4 Region2

请注意,这是

  1. 包含区域的行始终为Areacode = NA
  2. 在后续Region行之前总有一行Area

答案 1 :(得分:1)

您可以按Region中提及的Area的累计总和进行分组:

library(dplyr)

df <- data_frame(Area = c("Region1", "Area1", "Area2", "Area3", "Region2", "Area1", "Area4"), 
                 Value = c(23L, 2L, 1L, 20L, 14L, 10L, 4L))

df2 <- df %>% 
    # group by cumulative number of "Region" matches
    group_by(region_number = cumsum(grepl('Region', Area))) %>% 
    mutate(Region = Area[1]) %>%    # add Region name for each group
    slice(-1) %>%    # drop Region rows
    ungroup() %>% select(Area, Region, Value)    # drop index and rearrange

df2
#> # A tibble: 5 x 3
#>   Area  Region  Value
#>   <chr> <chr>   <int>
#> 1 Area1 Region1     2
#> 2 Area2 Region1     1
#> 3 Area3 Region1    20
#> 4 Area1 Region2    10
#> 5 Area4 Region2     4

答案 2 :(得分:1)

使用基础R解决方案:

do.call(rbind,by(df,cumsum(is.na(df$Areacode)),function(x)cbind(Region=x[1,1],x[-1,])))
     Region  Area Areacode Value
1.2 Region1 Area1        1     2
1.3 Region1 Area2        2     1
1.4 Region1 Area3        3    20
2.6 Region2 Area1        1    10
2.7 Region2 Area4        4     4