我有一个很长的数据框
mydf <- data.frame(
+ date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01"),
+ value=c(1,2,3,4,5,1,2,3,4,5),
+ country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+ indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population"))
date value country indicator
1 2016-01-01 1 US gdp
2 2016-02-01 2 US gdp
3 2016-03-01 3 US gdp
4 2016-04-01 4 US gdp
5 2016-05-01 5 US gdp
6 2016-02-01 1 US population
7 2016-03-01 2 US population
8 2016-04-01 3 US population
9 2016-05-01 4 US population
10 2016-06-01 5 US population
我想创建来自比率的特定新指标,例如: GDP /人口* 1000
它看起来像这样,它必须匹配每个相应指标的正确日期
mydf <- data.frame(
+ date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01"),
+ value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2),
+ country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+ indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita"))
date value country indicator
1 2016-01-01 1.00 US gdp
2 2016-02-01 2.00 US gdp
3 2016-03-01 3.00 US gdp
4 2016-04-01 4.00 US gdp
5 2016-05-01 5.00 US gdp
6 2016-02-01 1.00 US population
7 2016-03-01 2.00 US population
8 2016-04-01 3.00 US population
9 2016-05-01 4.00 US population
10 2016-06-01 5.00 US population
11 2016-02-01 2.00 US gdp per capita
12 2016-03-01 1.50 US gdp per capita
13 2016-04-01 1.33 US gdp per capita
14 2016-05-01 1.20 US gdp per capita
在R中有一种简单的方法吗?
答案 0 :(得分:2)
就个人而言,我发现重塑包更易于使用,并且它会自动处理多个国家/地区,无论您有多种类型的标签/数据类型。
library(reshape)
mydf <- data.frame(
date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01",
"2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01","2016-05-01"),
value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2, 2),
country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", 'AU'),
indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population",
"population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita", 'gdp'))
要获取新指标,首先要将数据设置为宽格式,以便相关列彼此相邻。这样您就可以进行简单的列式操作
df_wide = cast(mydf, date+country~indicator, sum)
您希望国家/地区和日期作为唯一定义行的列(公式的左侧),不同的指标作为列(公式的右侧)
date country gdp gdp per capita population
1 2016-01-01 US 1 0.00 0
2 2016-02-01 US 2 2.00 1
3 2016-03-01 US 3 1.50 2
4 2016-04-01 US 4 1.33 3
5 2016-05-01 AU 2 0.00 0
6 2016-05-01 US 5 1.20 4
7 2016-06-01 US 0 0.00 5
现在创建一个新列并将其设置为您想要的任何内容
df_wide['g_p_ratio'] = df_wide['gdp'] / df_wide['population']
然后使用Melt将其恢复为长格式
df_new = melt(df_wide, id=c('date'))
瞧!
date country value indicator
gdp 2016-01-01 US 1.00 gdp
gdp.1 2016-02-01 US 2.00 gdp
gdp.2 2016-03-01 US 3.00 gdp
gdp.3 2016-04-01 US 4.00 gdp
gdp.4 2016-05-01 AU 2.00 gdp
gdp.5 2016-05-01 US 5.00 gdp
gdp.6 2016-06-01 US 0.00 gdp
gdp.per.capita 2016-01-01 US 0.00 gdp per capita
gdp.per.capita.1 2016-02-01 US 2.00 gdp per capita
gdp.per.capita.2 2016-03-01 US 1.50 gdp per capita
gdp.per.capita.3 2016-04-01 US 1.33 gdp per capita
gdp.per.capita.4 2016-05-01 AU 0.00 gdp per capita
gdp.per.capita.5 2016-05-01 US 1.20 gdp per capita
gdp.per.capita.6 2016-06-01 US 0.00 gdp per capita
population 2016-01-01 US 0.00 population
population.1 2016-02-01 US 1.00 population
population.2 2016-03-01 US 2.00 population
population.3 2016-04-01 US 3.00 population
population.4 2016-05-01 AU 0.00 population
population.5 2016-05-01 US 4.00 population
population.6 2016-06-01 US 5.00 population
您可能想要也可能不想要新的行标签,但可以修复
rownames(df_new) <- 1:nrow(df_new)
答案 1 :(得分:1)
是的,我认为使用tidyr
和dplyr
使用整洁方法进行更改会更容易。
library(dplyr)
library(tidyr)
df <- tribble(
~date, ~value, ~country, ~indicator,
"2016-01-01", 1, "US", "gdp",
"2016-02-01", 2, "US", "gdp",
"2016-03-01", 3, "AU", "gdp",
"2016-04-01", 4, "US", "gdp",
"2016-05-01", 5, "US", "gdp",
"2016-02-01", 1, "US", "population",
"2016-03-01", 2, "AU", "population",
"2016-04-01", 3, "US", "population",
"2016-05-01", 4, "US", "population",
"2016-06-01", 5, "US", "population"
)
df %>%
group_by(country) %>%
spread(indicator, value) %>%
mutate(`gdp per capita` = gdp / population) %>%
gather(indicator, value, -c(date, country)) %>%
drop_na(value)
# # A tibble: 14 x 4
# # Groups: country [2]
# date country indicator value
# <chr> <chr> <chr> <dbl>
# 1 2016-01-01 US gdp 1.000000
# 2 2016-02-01 US gdp 2.000000
# 3 2016-03-01 AU gdp 3.000000
# 4 2016-04-01 US gdp 4.000000
# 5 2016-05-01 US gdp 5.000000
# 6 2016-02-01 US population 1.000000
# 7 2016-03-01 AU population 2.000000
# 8 2016-04-01 US population 3.000000
# 9 2016-05-01 US population 4.000000
# 10 2016-06-01 US population 5.000000
# 11 2016-02-01 US gdp per capita 2.000000
# 12 2016-03-01 AU gdp per capita 1.500000
# 13 2016-04-01 US gdp per capita 1.333333
# 14 2016-05-01 US gdp per capita 1.250000
N.B。我修改了数据并添加了
group_by
语句,以演示具有country
多个值的解决方案。