我有一个包含多个国家/地区的数据集,我想为各大洲创建一个虚拟变量。
此刻我的数据集如下:
+---------------+-----------+-----+-----+-----+
| Country | Period | X | Y | Z |
+---------------+-----------+-----+-----+-----+
| Argentina | 1991-1995 | ... | ... | ... |
| Argentina | 1996-2000 | ... | ... | ... |
| Bolivia | 1991-1995 | ... | ... | ... |
| Bolivia | 1996-2000 | ... | ... | ... |
| Brazil | 1991-1995 | ... | ... | ... |
| Brazil | 1996-2000 | ... | ... | ... |
| Canada | 1991-1995 | ... | ... | ... |
| Canada | 1996-2000 | ... | ... | ... |
| United States | 1991-1995 | ... | ... | ... |
| United States | 1996-2000 | ... | ... | ... |
+---------------+-----------+-----+-----+-----+
我想要的输出如下:
+---------------+-----------+-----+-----+-----+---------+---------+
| Country | Period | X | Y | Z | dummySA | dummyNA |
+---------------+-----------+-----+-----+-----+---------+---------+
| Argentina | 1991-1995 | ... | ... | ... | 1 | 0 |
| Argentina | 1996-2000 | ... | ... | ... | 1 | 0 |
| Bolivia | 1991-1995 | ... | ... | ... | 1 | 0 |
| Bolivia | 1996-2000 | ... | ... | ... | 1 | 0 |
| Brazil | 1991-1995 | ... | ... | ... | 1 | 0 |
| Brazil | 1996-2000 | ... | ... | ... | 1 | 0 |
| Canada | 1991-1995 | ... | ... | ... | 0 | 1 |
| Canada | 1996-2000 | ... | ... | ... | 0 | 1 |
| United States | 1991-1995 | ... | ... | ... | 0 | 1 |
| United States | 1996-2000 | ... | ... | ... | 0 | 1 |
+---------------+-----------+-----+-----+-----+---------+---------+
因此,我想为南美所有国家提供一个虚拟产品,为北美所有国家提供一个虚拟产品。我知道如何为单个国家或地区创建虚拟对象,但不能为多个值创建虚拟对象。
答案 0 :(得分:2)
如果只有少数几个国家,请使用%in%
library(dplyr)
df1 %>%
mutate(dummySA = as.integer(Country %in%
c("Argentina", "Bolivia", "Brazil")),
dummyNA = as.integer(!dummySA))
否则,请使用“国家/地区”和地理区域创建键/值数据集,并进行合并/联接,并通过spread
创建虚拟值
library(tidyr)
df1 %>%
left_join(keyvaldat) %>%
mutate(n = 1) %>%
spread(value, n, fill = 0)