数据转换:从R

时间:2017-09-04 08:55:50

标签: r dataframe dplyr data.table

我有一个(定向的)二元数据集,看起来像这样(见下文)。我现在想做的是每年只做一次观察。所以在这种情况下,1992年只有一次观察(AFG 1992)和1993年的一次观察(AFG 1993),同时删除了其他观察结果。从我保留在同一年的数据中观察到哪些观察结果并不重要(对country2不感兴趣)。

 country1   country2    year    X   X1
Afghanistan Colombia    1992    1   0.44
Afghanistan Venezuela   1992    1   0.45
Afghanistan Peru        1992    1   0.46
Afghanistan Brazil      1992    1   0.47
Afghanistan Bolivia     1992    1   0.48
Afghanistan Chile       1992    1   0.49
Afghanistan Argentina   1992    1   0.50
Afghanistan Uruguay     1993    0   0.51
Afghanistan USA         1993    0   0.52
Afghanistan Canada      1993    0   0.53
Afghanistan UK          1993    0   0.54
Afghanistan Netherlands 1993    0   0.55
Afghanistan Belgium     1993    0   0.56
Afghanistan Luxembourg  1993    0   0.57
Afghanistan France      1993    0   0.58

我的尝试:

newdata<- data %>% 
  group_by(country1,year) %>%
  summarise() %>%
  select(unique.x=country1, unique.y=year)

这是有效的但我如何保留&#34;数据&#34;中的所有其他变量?在&#34; newdata&#34;?我无法想到这样做的任何方式 (我觉得更实用)。有什么帮助吗?

期望的结果

    country1     year   X
    Afghanistan 1991   1
    Afghanistan 1992   0
  

dput(数据)结构(list(country1 = structure(c(1L,1L,1L,1L,1L,1L,   1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,   1L,1L,1L,1L,1L,1L,1L,1L,1L,1L),。标签=&#34;阿富汗&#34;,类   =&#34;因素&#34;),       country2 =结构(c(8L,33L,24L,5L,4L,7L,1L,32L,       31L,6L,30L,21L,3L,19L,14L,29L,27L,26L,15L,25L,       2L,17L,10L,18L,13L,28L,23L,11L,9L,16L,12L,20L,       22L),。Label = c(&#34;阿根廷&#34;,&#34;奥地利&#34;,&#34;比利时&#34;,&#34;玻利维亚,多民族国家&#34;,       &#34;巴西&#34;,&#34;加拿大&#34;,&#34;智利&#34;,&#34;哥伦比亚&#34;,&#34;古巴&#34;,&#34;捷克共和国&#34 ;,       &#34;丹麦&#34;,&#34;多米尼加共和国&#34;,&#34;芬兰&#34;,&#34;法国&#34;,&#34;德国&#34;,       &#34;几内亚比绍&#34;,&#34;匈牙利&#34;,&#34;意大利&#34;,&#34;卢森堡&#34;,&#34;毛里塔尼亚&#34;,       &#34;荷兰&#34;,&#34;尼日尔&#34;,&#34;挪威&#34;,&#34;秘鲁&#34;,&#34;波兰&#34;,&#34;葡萄牙& #34 ;,       &#34;西班牙&#34;,&#34;瑞典&#34;,&#34;瑞士&#34;,&#34;英国&#34;,&#34;美国&#34;,       &#34;乌拉圭&#34;,&#34;委内瑞拉,玻利瓦尔共和国&#34;),类=&#34;因素&#34;),       年= c(1992L,1992L,1992L,1992L,1992L,1992L,1992L,       1993L,1993L,1993L,1993L,1993L,1993L,1993L,1993L,1994L,       1994L,1994L,1994L,1994L,1994L,1994L,1994L,1995L,1995L,       1995L,1995L,1995L,1995L,1995L,1995L,1995L,1995L),       X = c(1L,1L,1L,1L,1L,1L,1L,0L,0L,0L,0L,0L,0L,       0L,0L,0L,0L,0L,0L,0L,0L,0L,0L,1L,1L,1L,1L,1L,       1L,1L,1L,1L,1L),X1 = c(0.44,0.45,0.46,0.47,0.48,       0.49,0.5,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,       0.59,0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,       0.69,0.7,0.71,0.72,0.73,0.74,0.75,0.76)),. Names = c(&#34; country1&#34;,&#34; country2&#34;,&#34; year&#34;, &#34; X&#34;,&#34; X1&#34;),class =&#34; data.frame&#34;,   row.names = c(NA,   -33L))

3 个答案:

答案 0 :(得分:1)

newdata <- olddata[!duplicated(olddata$year),]

回答问题

newdata <- olddata[!duplicated(paste(olddata$country1, olddata$year)),]

给你你想要的东西

答案 1 :(得分:0)

我不能真正理解您的问题,但要获得所需的输出,您可以使用:

data %>% 
  group_by(country1, year) %>%
  summarise(X = mean(X))

当您将其应用于整个data.frame时,请注意,对于Xcountry1的唯一组合,此代码将返回year中所有值的平均值。

答案 2 :(得分:0)

你可以尝试:

data %>%
    group_by(year) %>%
    top_n(1) %>%
    select(country1, X)