我有这个数据框,
DF
Area Areacode Value
Region1 NA 23
Area1 1 2
Area2 2 1
Area3 3 20
Region2 NA 14
Area1 1 10
Area4 4 4
我们如何在Area列中展平关系,因此输出应该如下,
Area AreaCode Region Value
Area1 1 Region1 2
Area2 2 Region1 1
Area3 3 Region1 20
Area1 1 Region2 10
Area4 4 Region2 4
对不起,忘了提一下,有些名字会包含区域文字。但要区别于区域,区域将不会有区号。
感谢。
答案 0 :(得分:2)
这个怎么样?
library(tidyverse);
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Region = ifelse(str_detect(Area, "Region"), Area, NA)) %>%
fill(Region) %>%
filter(!str_detect(Area, "Region"))
# Area Value Region
#1 Area1 2 Region1
#2 Area2 1 Region1
#3 Area3 20 Region1
#4 Area1 10 Region2
#5 Area4 4 Region2
说明:使用Region
中与"Region"
匹配的条目创建新列Area
。使用NA
将NA
替换为之前的非tidyr::fill
条目,并删除"Region"
列中与Area
匹配的行。
df <- read.table(text =
"Area Value
Region1 23
Area1 2
Area2 1
Area3 20
Region2 14
Area1 10
Area4 4", header = T)
根据您修改的样本数据我们可以做到:
df <- read.table(text =
"Area Areacode Value
Region1 NA 23
Area1 1 2
'Area region2' 2 1
Area3 3 20
Region2 NA 14
'Area region1' 1 10
Area4 4 4", header = T)
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Region = ifelse(is.na(Areacode), Area, NA)) %>%
fill(Region) %>%
filter(!is.na(Areacode));
# Area Areacode Value Region
#1 Area1 1 2 Region1
#2 Area region2 2 1 Region1
#3 Area3 3 20 Region1
#4 Area region1 1 10 Region2
#5 Area4 4 4 Region2
请注意,这是
Areacode = NA
。Region
行之前总有一行Area
。答案 1 :(得分:1)
您可以按Region
中提及的Area
的累计总和进行分组:
library(dplyr)
df <- data_frame(Area = c("Region1", "Area1", "Area2", "Area3", "Region2", "Area1", "Area4"),
Value = c(23L, 2L, 1L, 20L, 14L, 10L, 4L))
df2 <- df %>%
# group by cumulative number of "Region" matches
group_by(region_number = cumsum(grepl('Region', Area))) %>%
mutate(Region = Area[1]) %>% # add Region name for each group
slice(-1) %>% # drop Region rows
ungroup() %>% select(Area, Region, Value) # drop index and rearrange
df2
#> # A tibble: 5 x 3
#> Area Region Value
#> <chr> <chr> <int>
#> 1 Area1 Region1 2
#> 2 Area2 Region1 1
#> 3 Area3 Region1 20
#> 4 Area1 Region2 10
#> 5 Area4 Region2 4
答案 2 :(得分:1)
使用基础R解决方案:
do.call(rbind,by(df,cumsum(is.na(df$Areacode)),function(x)cbind(Region=x[1,1],x[-1,])))
Region Area Areacode Value
1.2 Region1 Area1 1 2
1.3 Region1 Area2 2 1
1.4 Region1 Area3 3 20
2.6 Region2 Area1 1 10
2.7 Region2 Area4 4 4