Question

我有针对不同国家和次国家区域的数据集。变量country确定国家/地区（a，b，c），变量region_country_X包含该国家/地区不同子区域的数值（和对于另一个国家的案例，NA。请参阅下面的代码以获取数据框：

set.seed(6543)
df <- data.frame(country = sample(c("a", "b", "c"), 1000, replace = TRUE),
         region_country_a = sample(c(0, 1, 2, 3, 4, 5, 6, 7), 1000, replace = TRUE),
         region_country_b = sample(c(0, 1, 2, 3, 4, 5, 6, 7, 8), 1000, replace = TRUE),
         region_country_c = sample(c(0, 1, 2, 3), 1000, replace = TRUE))
df$region_country_a <- ifelse(df$country != "a", NA, df$region_country_a)
df$region_country_b <- ifelse(df$country != "b", NA, df$region_country_b)
df$region_country_c <- ifelse(df$country != "c", NA, df$region_country_c)

数据框的头部如下所示：

> head(df, 5)
  country region_country_a region_country_b region_country_c
1       c                NA                NA                 1
2       b                NA                 3                NA
3       a                 2                NA                NA
4       c                NA                NA                 1
5       b                NA                 2                NA

我现在想在一列中添加一个包含所有区域的新变量，但无法弄清楚如何最好地解决此问题。

我希望r执行以下操作：

添加新列regions
浏览列country和region_country_a，..._b，..._c，并为每个组合获取一个新值（从0开始计算国家/地区{{1} }，区域a向上，为每个新的国家/地区组合添加下一个最高的数字。）

结果数据框看起来像这样：

我不确定如何才能最好地解决这个问题，因为我对country regions_country_a regions_country_b regions_country_c regions 1 c NA NA 1 18 #counting with a/0 = 0 etc., a7 = 7, b0 = 8 etc. 2 b NA 3 NA 11 3 a 2 NA NA 2 4 c NA NA 1 18 5 b NA 2 NA 10感到陌生，有人能指出我正确的方向吗？

Answer 1

如果我理解正确的话。您正尝试使用数字对四列的每个组合进行编码。如果是这样，您将获得这些组合的唯一组合，然后从行号中获取ID并将其连接回原始数据框。

library(dplyr)

df_un <- unique(df) %>%
  arrange(country) %>%
  mutate(region=row_number())

df <- left_join(df, df_un, by = c("country", "region_country_a", "region_country_b", "region_country_c"))

Answer 2

如果您只是减去dplyr::group_indices

，则可以使用1

library(dplyr)
df %>%
  mutate(id = group_indices(., country, region_country_a, region_country_b, region_country_c)-1) %>%
  head(5)

#   country region_country_a region_country_b region_country_c id
# 1       c                0                0                1 18
# 2       b                0                3                0 11
# 3       a                2                0                0  2
# 4       c                0                0                1 18
# 5       b                0                2                0 10

基于更改输入列的组合的新列值

2 个答案: