以下是我的示例数据集:
Name Course Cateory
1: Jason ML PT
2: Jason ML DI
3: Jason ML GT
4: Jason ML SY
5: Jason DS SY
6: Jason DS DI
7: Nancy ML PT
8: Nancy ML SY
9: Nancy DS DI
10: Nancy DS GT
11: James ML SY
12: John DS GT
我想删除重复的行,以便在数据框中包含唯一的行。删除重复行取决于列category
中的值。 category
列中对值的偏好按此顺序排列{' PT'' DI'' GT'' SY& #39;}。
我的输出数据框如下所示:
Name Course Cateory
1: Jason ML PT
2: Jason DS DI
3: Nancy ML PT
4: Nancy DS DI
5: James ML SY
6: John DS GT
目前,我正在使用for
循环和if
条件的组合。由于输入数据帧很大(1000万行),因此需要永远。是否有更好,更有效的方法来执行相同的操作?
答案 0 :(得分:10)
这是一个代码片段,可以满足您的要求:
df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))
df <- df[order(df$Category),]
df[!duplicated(df[,c('Name', 'Course')]),]
输出:
Name Course Category
Jason ML PT
Nancy ML PT
Jason DS DI
Nancy DS DI
John DS GT
James ML SY
理念是我们根据优先级结构进行排序。然后我们应用唯一的操作,这将返回第一个匹配。回报将是我们想要的。
答案 1 :(得分:2)
由于您提到您有1000万行,因此这是一个data.table
解决方案:
library(data.table)
setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]
<强>结果:强>
Name Course Category
1: Jason ML PT
2: Jason DS DI
3: Nancy ML PT
4: Nancy DS DI
5: James ML SY
6: John DS GT
<强>基准:强>
# Random resampling of `df` to generate 10 million rows
set.seed(123)
df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))
# Data prep Base R
df1 <- df_large
df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))
df1 <- df1[order(df1$Category), ]
# Data prep data.table
df2 <- df_large
df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))
setDT(df2)
结果:
library(microbenchmark)
microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ],
df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])
Unit: milliseconds
expr min lq mean
df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
df2[, .SD[which.min(df2$Category)], by = .(Name, Course)] 387.8435 409.9365 436.4381
median uq max neval
1774.3131 1803.7565 2085.9722 100
427.6739 451.1776 558.2749 100
数据:强>
df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L,
4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"),
Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L,
2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name",
"Course", "Category"), class = "data.frame", row.names = c("1:",
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:",
"12:"))
答案 2 :(得分:0)
我建议使用dplyr
包
见下文:
require(dplyr)
data %>%
mutate(
Category_factored=as.numeric(factor(Category,levels=c('PT','DI','GT','SY'),labels=1:4))
) %>%
group_by(Name,Course) %>%
filter(
Category_factored == min(Category_factored)
)
如果您是R新手,请使用install.packages('dplyr')
答案 3 :(得分:0)
我可能会迟到,但是我相信这是最简单的解决方案。既然您提到了1000万行,我建议使用非常易于理解的unique
函数
require("data.table")
df <- data.table("Name" = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"), "Course" = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"), "category" = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"))
unique(df[, category := factor(category, levels = c("PT","DI","GT","SY"))][order(df$"category")], by = c("Name", "Course"))
Name Course category
1: Jason ML PT
2: Nancy ML PT
3: Jason DS DI
4: Nancy DS DI
5: John DS GT
6: James ML SY