Question

我有一个这样的数据框：

   indx country year  death value
1     1   Italy 2000    hiv     1
2     1   Italy 2001    hiv     2
3     1   Italy 2005    hiv     3
4     1   Italy 2000 cancer     4
5     1   Italy 2001 cancer     5
6     1   Italy 2002 cancer     6
7     1   Italy 2003 cancer     7
8     1   Italy 2004 cancer     8
9     1   Italy 2005 cancer     9
10    4  France 2000    hiv    10
11    4  France 2004    hiv    11
12    4  France 2005    hiv    12
13    4  France 2001 cancer    13
14    4  France 2002 cancer    14
15    4  France 2003 cancer    15
16    4  France 2004 cancer    16
17    2   Spain 2000    hiv    17
18    2   Spain 2001    hiv    18
19    2   Spain 2002    hiv    19
20    2   Spain 2003    hiv    20
21    2   Spain 2004    hiv    21
22    2   Spain 2005    hiv    22
23    2   Spain  ...    ...    ...

indx是与country相关联的值（相同country =相同indx）。

在这个例子中，我只使用了3个国家（country）和2个疾病（death），在原始数据框中还有更多。

我希望从2000年到2005年，每个国家的每种疾病都有一排。

我想得到的是：

    indx  country  year  death  value
1      1    Italy  2000    hiv      1
2      1    Italy  2001    hiv      2
3      1    Italy  2002    hiv     NA
4      1    Italy  2003    hiv     NA
5      1    Italy  2004    hiv     NA
6      1    Italy  2005    hiv      3
7      1    Italy  2000 cancer      4
8      1    Italy  2001 cancer      5
9      1    Italy  2002 cancer      6
10     1    Italy  2003 cancer      7
11     1    Italy  2004 cancer      8
12     1    Italy  2005 cancer      9
13     4   France  2000    hiv     10
14     4   France  2001    hiv     NA
15     4   France  2002    hiv     NA
16     4   France  2003    hiv     NA
17     4   France  2004    hiv     11
18     4   France  2005    hiv     12
19     4   France  2000 cancer     NA
20     4   France  2001 cancer     13
21     4   France  2002 cancer     14
22     4   France  2003 cancer     15
23     4   France  2004 cancer     16
24     4   France  2005 cancer     NA
25     2    Spain  2000    hiv     17
26     2    Spain  2001    hiv     18
27     2    Spain  2002    hiv     19
28     2    Spain  2003    hiv     20
29     2    Spain  2004    hiv     21
30     2    Spain  2005    hiv     22
31     2    Spain  ...     ...     ...

即。我希望在每个国家的每个疾病的缺失年份添加value = NA行。

例如，它在2002年至2004年期间缺乏意大利的艾滋病病毒数据，然后我用value = NA添加了这一行。

我该怎么做？

可重复的例子：

indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)

Answer 1

使用基数R，你可以这样做：

# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)

首先在country，death和year中创建唯一值的所有组合，然后将其合并到原始数据中，以添加values和其中组合不在原始数据中，它添加了NA s。

在包tidyr中，有一个特殊的功能，可以通过一个命令为您完成此任务：

library(tidyr)
complete(dfl, country, year, death)

Answer 2

这是一个更长的基础R方法。您创建两个新的data.frames，一个包含country，year和death的所有组合，另一个包含索引键。

# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
                              "death"=unique(death)))

# get index key
indexKey <- unique(df[, c("indx", "country")])

# merge these together
dfNew <- merge(indexKey, dfNew, by="country")

# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)

返回

dfNew
   indx country year  death value
1     1   Italy 2000 cancer     4
2     1   Italy 2000    hiv     1
3     1   Italy 2001 cancer     5
4     1   Italy 2001    hiv     2
5     1   Italy 2002 cancer     6
6     1   Italy 2002    hiv    NA
7     1   Italy 2003 cancer     7
8     1   Italy 2003    hiv    NA
9     1   Italy 2004 cancer     8
10    1   Italy 2004    hiv    NA
11    1   Italy 2005 cancer     9
12    1   Italy 2005    hiv     3
13    2   Spain 2000 cancer    NA
14    2   Spain 2000    hiv    17
15    2   Spain 2001 cancer    NA
...

如果df是data.table，则以下是相应的代码行：

# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
            .(country, year, death, value)]

indexKey <- unique(df[, .(indx, country)])

dfNew <- merge(indexKey, dfNew, by="country")

dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)

请注意，它不是使用CJ，而是可以在data.frame版本中使用expand.grid：

dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
                          "death"=unique(death))]

Answer 3

tidyr::complete有助于创建您传递的变量的所有组合，但如果您有两列完全相同，则会过度扩展或将NA放在您不想要的位置。作为一种变通方法，您可以使用dplyr分组（df %>% group_by(indx, country) %>% complete(death, year)）或仅将两列合并为一列：

library(tidyr)

       # merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>% 
    # fill in missing combinations of new column, death, and year
    complete(indx_country, death, year) %>% 
    # separate indx and country back to how they were
    separate(indx_country, c('indx', 'country'))

# Source: local data frame [36 x 5]
# 
#     indx country  death  year value
#    (chr)   (chr) (fctr) (int) (int)
# 1      1   Italy cancer  2000     4
# 2      1   Italy cancer  2001     5
# 3      1   Italy cancer  2002     6
# 4      1   Italy cancer  2003     7
# 5      1   Italy cancer  2004     8
# 6      1   Italy cancer  2005     9
# 7      1   Italy    hiv  2000     1
# 8      1   Italy    hiv  2001     2
# 9      1   Italy    hiv  2002    NA
# 10     1   Italy    hiv  2003    NA
# ..   ...     ...    ...   ...   ...

添加NA值为

3 个答案: