Question

我使用bind_rows合并了两个数据帧。我有一种情况，我有两行数据，如下所示：

Page Path                           Page Title             Byline      Pageviews 
/facilities/when-lighting-strikes      NA                    NA           668
/facilities/when-lighting-strikes   When Lighting Strikes  Tom Jones       NA

当我有这些类型的重复页面路径时，我想合并相同的页面路径，消除第一行中的两个NA，保留页面标题（When Lighting Strikes）和Byline（Tom琼斯）然后从第一行保持668的综合浏览量结果。不知怎的，似乎我需要

识别重复的网页路径
看看是否有不同的标题和副行;删除NAs
使用网页浏览结果保留行;删除NA行

我有可能在R dplyr中做到这一点吗？或者有更好的方法吗？

Answer 1

一个简单的解决方案：

library(dplyr)

df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
# 
#                            PagePath             PageTitle    Byline Pageviews
#                              (fctr)                (fctr)    (fctr)     (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

如果您的数据更复杂，您可能需要更强大的方法。

数据

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))

Answer 2

Use replace function in for loop

for(i in unique(df$Page_Path)){
  df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
    df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}

df <- subset(df, !is.na(Page_Title))

print(df)

                          Page_Path            Page_Title    Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

Answer 3

另一种方法（类似于之前使用dplyr的解决方案）将是：

  df %>% group_by(PagePath) %>% 
  dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
                   Byline = paste(na.omit(Byline)),
                   Pageviews =paste(na.omit(Pageviews)))

Answer 4

以下是使用data.table和complete.cases的选项。我们转换了＆＃39; data.frame＆＃39;到＆＃39; data.table＆＃39; （setDT(df)）按路径路径＆＃39;分组，循环遍历数据集的列（lapply(.SD, ..），并使用complete.cases删除NA元素。 complete.cases返回逻辑vector，可用于子集化。根据{{3}}，complete.cases使用速度比na.omit快得多，再加上data.table，它可以提高效率。

library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
#                     PagePath             PageTitle    Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones       668

数据

df <- structure(list(PagePath = structure(c(1L, 1L), 
 .Label = "/facilities/when-lighting-strikes", class = "factor"),   
    PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
    Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
    Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
-2L))

Answer 5

使用填充的另一种方法。将tidyverse 1.3.0+与dplyr 0.8.5+结合使用，您可以使用 fill 填写缺失值。

有关更多信息，请参见此https://tidyr.tidyverse.org/reference/fill.html

数据谢谢阿利斯泰尔

df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"), 
        PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"), 
        Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"), 
        Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle", 
    "Byline", "Pageviews"), class = "data.frame", row.names = c(NA, 
    -2L))

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes NA                    NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

代码

我只是对 PageTitle 进行了此操作，但是您可以重复填充以对其他列进行此操作。（dplyr专家可能有一种更聪明的方式可以同时完成所有3列）。如果您已订购日期之类的数据，则可以将.direction设置为仅向下（例如查看过去的数据）。

df.new <- df %>% group_by(PagePath) 
             %>% fill(PageTitle, .direction = "updown")

为您提供

# A tibble: 2 x 4
# Groups:   PagePath [1]
  PagePath                          PageTitle             Byline    Pageviews
  <fct>                             <fct>                 <fct>         <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA              668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones        NA

一旦清理了所有的NA，就可以使用distinct或rank来获得最终的汇总数据框。

根据规则合并R中的两行数据

5 个答案:

数据

数据