在长数据中创建新的比率指标

时间:2017-12-06 04:54:30

标签: r apply long-integer

我有一个很长的数据框

mydf <- data.frame(
+     date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01"),
+     value=c(1,2,3,4,5,1,2,3,4,5),
+     country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+     indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population"))

         date value country  indicator
1  2016-01-01     1      US        gdp
2  2016-02-01     2      US        gdp
3  2016-03-01     3      US        gdp
4  2016-04-01     4      US        gdp
5  2016-05-01     5      US        gdp
6  2016-02-01     1      US population
7  2016-03-01     2      US population
8  2016-04-01     3      US population
9  2016-05-01     4      US population
10 2016-06-01     5      US population

我想创建来自比率的特定新指标,例如: GDP /人口* 1000

它看起来像这样,它必须匹配每个相应指标的正确日期

mydf <- data.frame(
+     date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01"),
+     value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2),
+     country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US"),
+     indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population", "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita"))

         date value country      indicator
1  2016-01-01  1.00      US            gdp
2  2016-02-01  2.00      US            gdp
3  2016-03-01  3.00      US            gdp
4  2016-04-01  4.00      US            gdp
5  2016-05-01  5.00      US            gdp
6  2016-02-01  1.00      US     population
7  2016-03-01  2.00      US     population
8  2016-04-01  3.00      US     population
9  2016-05-01  4.00      US     population
10 2016-06-01  5.00      US     population
11 2016-02-01  2.00      US gdp per capita
12 2016-03-01  1.50      US gdp per capita
13 2016-04-01  1.33      US gdp per capita
14 2016-05-01  1.20      US gdp per capita

在R中有一种简单的方法吗?

2 个答案:

答案 0 :(得分:2)

就个人而言,我发现重塑包更易于使用,并且它会自动处理多个国家/地区,无论您有多种类型的标签/数据类型。

library(reshape)
mydf <- data.frame(
date=c("2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01", 
       "2016-06-01", "2016-02-01", "2016-03-01", "2016-04-01", "2016-05-01","2016-05-01"),
value=c(1,2,3,4,5,1,2,3,4,5,2,1.5,1.33,1.2, 2),
country=c("US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", 'AU'),
indicator=c("gdp", "gdp", "gdp", "gdp", "gdp", "population", "population", "population",
            "population", "population", "gdp per capita", "gdp per capita", "gdp per capita", "gdp per capita", 'gdp'))

要获取新指标,首先要将数据设置为宽格式,以便相关列彼此相邻。这样您就可以进行简单的列式操作

df_wide = cast(mydf, date+country~indicator, sum)

您希望国家/地区和日期作为唯一定义行的列(公式的左侧),不同的指标作为列(公式的右侧)

        date country gdp gdp per capita population
1 2016-01-01      US   1           0.00          0
2 2016-02-01      US   2           2.00          1
3 2016-03-01      US   3           1.50          2
4 2016-04-01      US   4           1.33          3
5 2016-05-01      AU   2           0.00          0
6 2016-05-01      US   5           1.20          4
7 2016-06-01      US   0           0.00          5

现在创建一个新列并将其设置为您想要的任何内容

df_wide['g_p_ratio'] = df_wide['gdp'] / df_wide['population'] 

然后使用Melt将其恢复为长格式

df_new = melt(df_wide, id=c('date'))

瞧!

                       date country value      indicator
gdp              2016-01-01      US  1.00            gdp
gdp.1            2016-02-01      US  2.00            gdp
gdp.2            2016-03-01      US  3.00            gdp
gdp.3            2016-04-01      US  4.00            gdp
gdp.4            2016-05-01      AU  2.00            gdp
gdp.5            2016-05-01      US  5.00            gdp
gdp.6            2016-06-01      US  0.00            gdp
gdp.per.capita   2016-01-01      US  0.00 gdp per capita
gdp.per.capita.1 2016-02-01      US  2.00 gdp per capita
gdp.per.capita.2 2016-03-01      US  1.50 gdp per capita
gdp.per.capita.3 2016-04-01      US  1.33 gdp per capita
gdp.per.capita.4 2016-05-01      AU  0.00 gdp per capita
gdp.per.capita.5 2016-05-01      US  1.20 gdp per capita
gdp.per.capita.6 2016-06-01      US  0.00 gdp per capita
population       2016-01-01      US  0.00     population
population.1     2016-02-01      US  1.00     population
population.2     2016-03-01      US  2.00     population
population.3     2016-04-01      US  3.00     population
population.4     2016-05-01      AU  0.00     population
population.5     2016-05-01      US  4.00     population
population.6     2016-06-01      US  5.00     population

您可能想要也可能不想要新的行标签,但可以修复

rownames(df_new) <- 1:nrow(df_new)

答案 1 :(得分:1)

是的,我认为使用tidyrdplyr使用整洁方法进行更改会更容易。

library(dplyr)
library(tidyr)

df <- tribble(
         ~date, ~value, ~country,   ~indicator,
  "2016-01-01",      1,     "US",        "gdp",
  "2016-02-01",      2,     "US",        "gdp",
  "2016-03-01",      3,     "AU",        "gdp",
  "2016-04-01",      4,     "US",        "gdp",
  "2016-05-01",      5,     "US",        "gdp",
  "2016-02-01",      1,     "US", "population",
  "2016-03-01",      2,     "AU", "population",
  "2016-04-01",      3,     "US", "population",
  "2016-05-01",      4,     "US", "population",
  "2016-06-01",      5,     "US", "population"
)

df %>%
  group_by(country) %>%
  spread(indicator, value) %>%
  mutate(`gdp per capita` = gdp / population) %>%
  gather(indicator, value, -c(date, country)) %>%
  drop_na(value)

# # A tibble: 14 x 4
# # Groups:   country [2]
#          date country      indicator    value
#         <chr>   <chr>          <chr>    <dbl>
#  1 2016-01-01      US            gdp 1.000000
#  2 2016-02-01      US            gdp 2.000000
#  3 2016-03-01      AU            gdp 3.000000
#  4 2016-04-01      US            gdp 4.000000
#  5 2016-05-01      US            gdp 5.000000
#  6 2016-02-01      US     population 1.000000
#  7 2016-03-01      AU     population 2.000000
#  8 2016-04-01      US     population 3.000000
#  9 2016-05-01      US     population 4.000000
# 10 2016-06-01      US     population 5.000000
# 11 2016-02-01      US gdp per capita 2.000000
# 12 2016-03-01      AU gdp per capita 1.500000
# 13 2016-04-01      US gdp per capita 1.333333
# 14 2016-05-01      US gdp per capita 1.250000
  

N.B。我修改了数据并添加了group_by语句,以演示具有country多个值的解决方案。