如何按年减去每个国家的价值

时间:2019-07-05 03:44:55

标签: r

我有每个国家幸福的数据(https://www.kaggle.com/unsdsn/world-happiness),并且每年都有报告的数据。现在,我不知道如何将每年的价值相互减去,例如从2015年至2017年/ 2016年至2017年幸福等级如何变化?我想为每一个做一个新的df。

我能够将表的列绑定在一起,并开始着手删除所有三年都没有数据的国家/地区。我不确定是否要走复杂的道路。

keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols ) 
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols ) 
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)

df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))

没有明确的途径来完成我想要的事情,但是看起来像

df16_to_17
Country   Happiness.Rank  ...(other columns)
Yemen     (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA       (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)

df15_to_16
Country   Happiness.Rank  ...(other columns)
Yemen     (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA       (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)

2 个答案:

答案 0 :(得分:0)

假设您的环境中存在三个名为data2015data2016data2017的数据集,我们可以在年份中添加一个year列,并保留keepcols向量中存在的列。 arrangeCountry Year group_by的数据Country仅保留所有3年中存在的国家/地区,然后从前几行中减去这些值使用lagdiff

library(dplyr)

data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]

data %>%
  arrange(Country, Year) %>% 
  group_by(Country) %>%
  filter(n() == 3) %>%
  mutate_at(-1, ~. - lag(.)) #OR
  #mutate_at(-1, ~c(NA, diff(.)))

# A tibble: 438 x 10
# Groups:   Country [146]
#   Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
#   <chr>            <int>            <dbl>  <dbl>            <dbl>   <dbl>
# 1 Afghan…             NA          NA      NA            NA        NA     
# 2 Afghan…              1           0.0624 -0.192        -0.130    -0.0698
# 3 Afghan…            -13           0.0192  0.471         0.00731  -0.0581
# 4 Albania             NA          NA      NA            NA        NA     
# 5 Albania             14           0.0766 -0.303        -0.0832   -0.0387
# 6 Albania              0           0.0409  0.302         0.00109   0.0628
# 7 Algeria             NA          NA      NA            NA        NA     
# 8 Algeria            -30           0.113  -0.245         0.00038  -0.0757
# 9 Algeria             15           0.0392  0.313        -0.000455  0.0233
#10 Angola              NA          NA      NA            NA        NA     
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
#   Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl> 

每个Year的第一行的值将始终为NA,其余的值将被其先前的值减去。

答案 1 :(得分:0)

使用dplyr非常简单,涉及按国家/地区分组,然后使用基数R的diff查找连续值之间的差异。只需确保使用df而不是df15,等等。

library(dplyr)

rank_diff_df <- df %>% 
    group_by(Country) %>% 
    mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))

以上假设数据是按年份排列的,在您的情况下,由于您组合数据框的方式,因此它们按您的情况排列。否则,您需要先致电arrange(Year),然后再致电mutate。不需要过滤掉缺少年份数据的国家,但是可以在group_by()filter(n() == 3)之后进行。

如果您想查看差异,则可以删除一些变量并重新排列数据:

rank_diff_df %>% 
    select(Year, Country, Happiness.Rank, Rank.Diff) %>% 
    arrange(Country)

哪个返回:

# A tibble: 470 x 4
# Groups:   Country [166]
   Year  Country     Happiness.Rank Rank.Diff
   <chr> <fct>                <int>     <int>
 1 2015  Afghanistan            153        NA
 2 2016  Afghanistan            154         1
 3 2017  Afghanistan            141       -13
 4 2015  Albania                 95        NA
 5 2016  Albania                109        14
 6 2017  Albania                109         0
 7 2015  Algeria                 68        NA
 8 2016  Algeria                 38       -30
 9 2017  Algeria                 53        15
10 2015  Angola                 137        NA
# … with 460 more rows

如果您打算绘制结果,则上述数据框将与ggplot2配合使用。

如果您对dplyr不满意,可以使用基数R的merge组合数据框,然后创建一个新的数据框,其差异为列:

df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")

rank_diff_df <- data.frame(Country = df_wide$Country,
                           Y2015.2016 = df_wide$Happiness.Rank.y -
                               df_wide$Happiness.Rank.x,
                           Y2016.2017 = df_wide$Happiness.Rank -
                               df_wide$Happiness.Rank.y
                           )

哪个返回:

head(rank_diff_df, 10)

       Country Y2015.2016 Y2016.2017
1  Afghanistan          1        -13
2      Albania         14          0
3      Algeria        -30         15
4       Angola          4         -1
5    Argentina         -4         -2
6      Armenia         -6          0
7    Australia         -1          1
8      Austria         -1          1
9   Azerbaijan          1          4
10     Bahrain         -7         -1