我有这个数据
free()
我想删除重复的行:
行重复的规则是:
对于此类行,如果userID和Quiz_Date列值也相同,则行重复。
UserID Quiz_answers Quiz_Date
1 `a1,a2,a3`Positive 26-01-2017
1 `a1,a4,a3`Positive 26-01-2017
1 `a1,a2,a4`Negative 28-02-2017
1 `a1,a2,a3`Neutral 30-10-2017
1 `a1,a2,a4`Positive 30-11-2017
1 `a1,a2,a4`Negative 28-02-2018
2 `a1,a2,a3`Negative 27-01-2017
2 `a1,a7,a3`Neutral 28-08-2017
2 `a1,a2,a5`Negative 28-01-2017
- 我写了以下代码
UserID<-c(1,1,1,1,1,1,2,2,2)
Quiz_answers<-c("`a1,a2,a3`Positive","`a1,a4,a3`Positive","`a1,a2,a4`Negative","a1,a2,a3`Neutral","`a1,a2,a4`Positive","`a1,a2,a4`Negative","`a1,a2,a3`Negative","`a1,a7,a3`Neutral","`a1,a2,a5`Negative")
Quiz_Date<-as.Date(c("26-01-2017","26-01-2017","28-02-2017","30-10-2017","30-11-2017","28-02-2018","27-01-2017","28-08-2017","28-01-2017"),'%d-%m-%Y')
data<-data.frame(UserID,Quiz_answers,Quiz_Date)
我期待输出
data.removeDuplicates<- function(frames)
{
apply(frames[ ,c(grep("UserID", colnames(data)),grep("Quiz_answers", colnames(data)),grep("Quiz_Date", colnames(data)))],1,function(slice){
Outcome<-paste0("`",tail(strsplit(slice[2],split="`")[[1]],1))
cat("\n\n Searching for records: ",slice[1],Outcome,slice[3])
data<<-data[!( data$UserID == slice[1] & paste0("`",sapply(strsplit(as.character(data[,2]),'`'), tail, 1)) == c(Outcome) & data[,3]==c(slice[3])), ]
})
print(frames)
}
data.removeDuplicates(data)
print(data)
[1] UserID Quiz_answers Quiz_Date
<0 rows> (or 0-length row.names)
根据规则,只有第二行应该从DataFrame中删除,这是满足重复条件的唯一行。 我做错了什么?
答案 0 :(得分:1)
试一试
您的数据
df <- read.table(text="UserID Quiz_answers Quiz_Date
1 `a1,a2,a3`Positive 26-01-2017
1 `a1,a4,a3`Positive 26-01-2017
1 `a1,a2,a4`Negative 28-02-2017
1 `a1,a2,a3`Neutral 30-10-2017
1 `a1,a2,a4`Positive 30-11-2017
1 `a1,a2,a4`Negative 28-02-2018
2 `a1,a2,a3`Negative 27-01-2017
2 `a1,a7,a3`Neutral 28-08-2017
2 `a1,a2,a5`Negative 28-01-2017", header = TRUE, stringsAsFactors=FALSE)
解决方案&amp;输出
library(dplyr)
ans <- df %>%
mutate(grp = sub(".*`(\\D+)$", "\\1", Quiz_answers)) %>%
group_by(grp, UserID, Quiz_Date) %>%
slice(1) %>%
ungroup() %>%
select(-grp) %>%
arrange(UserID, Quiz_Date)
# A tibble: 8 x 3
# UserID Quiz_answers Quiz_Date
# <int> <chr> <chr>
# 1 1 `a1,a2,a3`Positive 26-01-2017
# 2 1 `a1,a2,a4`Negative 28-02-2017
# 3 1 `a1,a2,a4`Negative 28-02-2018
# 4 1 `a1,a2,a3`Neutral 30-10-2017
# 5 1 `a1,a2,a4`Positive 30-11-2017
# 6 2 `a1,a2,a3`Negative 27-01-2017
# 7 2 `a1,a2,a5`Negative 28-01-2017
# 8 2 `a1,a7,a3`Neutral 28-08-2017
答案 1 :(得分:0)
您可以使用以下sqldf
包。首先,找到Positive
,Negative
和Neutral
的组。然后,使用group by
:
require("sqldf")
result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")
运行后的result
是:
UserID Quiz_answers Quiz_Date
1 1 `a1,a2,a3`Neutral 30-10-2017
2 1 `a1,a2,a4`Negative 28-02-2017
3 1 `a1,a2,a4`Negative 28-02-2018
4 1 `a1,a2,a4`Positive 30-11-2017
5 1 `a1,a4,a3`Positive 26-01-2017
6 2 `a1,a2,a3`Negative 27-01-2017
7 2 `a1,a2,a5`Negative 28-01-2017
8 2 `a1,a7,a3`Neutral 28-08-2017
答案 2 :(得分:0)
这是一个双线解决方案,仅使用基数R:
data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))
data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers") ]))), !(names(data) %in% "group")]