根据规则合并R中的两行数据

时间:2016-05-31 14:01:31

标签: r dplyr

enter image description here我使用bind_rows合并了两个数据帧。我有一种情况,我有两行数据,如下所示:

Page Path                           Page Title             Byline      Pageviews 
/facilities/when-lighting-strikes      NA                    NA           668
/facilities/when-lighting-strikes   When Lighting Strikes  Tom Jones       NA

当我有这些类型的重复页面路径时,我想合并相同的页面路径,消除第一行中的两个NA,保留页面标题(When Lighting Strikes)和Byline(Tom琼斯)然后从第一行保持668的综合浏览量结果。不知怎的,似乎我需要

  1. 识别重复的网页路径
  2. 看看是否有不同的标题和副行;删除NAs
  3. 使用网页浏览结果保留行;删除NA行
  4. 我有可能在R dplyr中做到这一点吗?或者有更好的方法吗?

5 个答案:

答案 0 :(得分:3)

一个简单的解决方案:

library(dplyr)

df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
# 
#                            PagePath             PageTitle    Byline Pageviews
#                              (fctr)                (fctr)    (fctr)     (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

如果您的数据更复杂,您可能需要更强大的方法。

数据

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))

答案 1 :(得分:1)

Use replace function in for loop

for(i in unique(df$Page_Path)){
  df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
    df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}

df <- subset(df, !is.na(Page_Title))

print(df)

                          Page_Path            Page_Title    Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

答案 2 :(得分:0)

另一种方法(类似于之前使用dplyr的解决方案)将是:

  df %>% group_by(PagePath) %>% 
  dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
                   Byline = paste(na.omit(Byline)),
                   Pageviews =paste(na.omit(Pageviews)))

答案 3 :(得分:0)

以下是使用data.tablecomplete.cases的选项。我们转换了&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df))按路径路径&#39;分组,循环遍历数据集的列(lapply(.SD, ..),并使用complete.cases删除NA元素。 complete.cases返回逻辑vector,可用于子集化。根据{{​​3}},complete.cases使用速度比na.omit快得多,再加上data.table,它可以提高效率。

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
#                     PagePath             PageTitle    Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

数据

df <- structure(list(PagePath = structure(c(1L, 1L), 
 .Label = "/facilities/when-lighting-strikes", class = "factor"),   
    PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
    Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
    Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
-2L))

答案 4 :(得分:0)

使用填充的另一种方法。将tidyverse 1.3.0+与dplyr 0.8.5+结合使用,您可以使用 fill 填写缺失值。

有关更多信息,请参见此https://tidyr.tidyverse.org/reference/fill.html

数据谢谢阿利斯泰尔

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes NA                    NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

代码

我只是对 PageTitle 进行了此操作,但是您可以重复填充以对其他列进行此操作。 (dplyr专家可能有一种更聪明的方式可以同时完成所有3列)。如果您已订购日期之类的数据,则可以将.direction设置为仅向下(例如查看过去的数据)。

df.new <- df %>% group_by(PagePath) 
             %>% fill(PageTitle, .direction = "updown")

为您提供

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

一旦清理了所有的NA,就可以使用distinct或rank来获得最终的汇总数据框。