我使用bind_rows合并了两个数据帧。我有一种情况,我有两行数据,如下所示:
Page Path Page Title Byline Pageviews
/facilities/when-lighting-strikes NA NA 668
/facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
当我有这些类型的重复页面路径时,我想合并相同的页面路径,消除第一行中的两个NA,保留页面标题(When Lighting Strikes)和Byline(Tom琼斯)然后从第一行保持668的综合浏览量结果。不知怎的,似乎我需要
我有可能在R dplyr中做到这一点吗?或者有更好的方法吗?
答案 0 :(得分:3)
一个简单的解决方案:
library(dplyr)
df %>% group_by(PagePath) %>% summarise_each(funs(na.omit))
# Source: local data frame [1 x 4]
#
# PagePath PageTitle Byline Pageviews
# (fctr) (fctr) (fctr) (int)
# 1 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
如果您的数据更复杂,您可能需要更强大的方法。
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
答案 1 :(得分:1)
Use replace function in for loop
for(i in unique(df$Page_Path)){
df$Pageviews[df$Page_Path==i] <- replace(df$Pageviews[df$Page_Path==i],is.na(df$Pageviews[df$Page_Path==i]),
df$Pageviews[!is.na(df$Pageviews[df$Page_Path==i])])
}
df <- subset(df, !is.na(Page_Title))
print(df)
Page_Path Page_Title Byline Pageviews
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
答案 2 :(得分:0)
另一种方法(类似于之前使用dplyr的解决方案)将是:
df %>% group_by(PagePath) %>%
dplyr::summarize(PageTitle = paste(na.omit(PageTitle)),
Byline = paste(na.omit(Byline)),
Pageviews =paste(na.omit(Pageviews)))
答案 3 :(得分:0)
以下是使用data.table
和complete.cases
的选项。我们转换了&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df)
)按路径路径&#39;分组,循环遍历数据集的列(lapply(.SD, ..
),并使用complete.cases
删除NA元素。 complete.cases
返回逻辑vector
,可用于子集化。根据{{3}},complete.cases
使用速度比na.omit
快得多,再加上data.table
,它可以提高效率。
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[complete.cases(x)]), by = PagePath]
# PagePath PageTitle Byline Pageviews
#1: /facilities/when-lighting-strikes When Lighting Strikes Tom Jones 668
df <- structure(list(PagePath = structure(c(1L, 1L),
.Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
答案 4 :(得分:0)
使用填充的另一种方法。将tidyverse
1.3.0+与dplyr
0.8.5+结合使用,您可以使用 fill 填写缺失值。
有关更多信息,请参见此https://tidyr.tidyverse.org/reference/fill.html
数据谢谢阿利斯泰尔
df <- structure(list(PagePath = structure(c(1L, 1L), .Label = "/facilities/when-lighting-strikes", class = "factor"),
PageTitle = structure(c(NA, 1L), .Label = "When Lighting Strikes", class = "factor"),
Byline = structure(c(NA, 1L), .Label = "Tom Jones", class = "factor"),
Pageviews = c(668L, NA)), .Names = c("PagePath", "PageTitle",
"Byline", "Pageviews"), class = "data.frame", row.names = c(NA,
-2L))
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes NA NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
代码
我只是对 PageTitle 进行了此操作,但是您可以重复填充以对其他列进行此操作。 (dplyr
专家可能有一种更聪明的方式可以同时完成所有3列)。如果您已订购日期之类的数据,则可以将.direction
设置为仅向下(例如查看过去的数据)。
df.new <- df %>% group_by(PagePath)
%>% fill(PageTitle, .direction = "updown")
为您提供
# A tibble: 2 x 4
# Groups: PagePath [1]
PagePath PageTitle Byline Pageviews
<fct> <fct> <fct> <int>
1 /facilities/when-lighting-strikes When Lighting Strikes NA 668
2 /facilities/when-lighting-strikes When Lighting Strikes Tom Jones NA
一旦清理了所有的NA,就可以使用distinct或rank来获得最终的汇总数据框。