我正在尝试过滤我的数据集以摆脱加倍的行。但是,我想在两个不同的列上进行过滤,如果反向采用它们(原点 - 目标数据)。以下是数据示例:
data2<-matrix(NA, nrow = 7, ncol=5)
colnames(data2)<-c("City.Pair", "Origin.City", "Destination.City", "Total.Passengers", "Total.Revenue")
data2[,1] <- c("LIS-BRU","LIS-LHR","LAD-LIS", "LIS-LAD", "FAO-MAN", "MAN-FAO","LIS-ORY")
data2[,2]<- c("LISBON", "LISBON", "LUANDA", "LISBON", "FARO", "MANCHESTER", "LISBON")
data2[,3] <- c("BRUSSELS","LONDON", "LISBON", "LUANDA", "MANCHESTER", "FARO", "PARIS" )
data2[,4] <- c(100, 5000, 200, 200, 4000, 4000, 4000)
data2[,5] <- c(100.66, 5000.25, 200.75, 200.75, 4000.10, 4000.10, 4000.05)
data2<-data.frame(data2)
City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
1 LIS-BRU LISBON BRUSSELS 100 100.66
2 LIS-LHR LISBON LONDON 5000 5000.25
3 LAD-LIS LUANDA LISBON 200 200.75
4 LIS-LAD LISBON LUANDA 200 200.75
5 FAO-MAN FARO MANCHESTER 4000 4000.1
6 MAN-FAO MANCHESTER FARO 4000 4000.1
7 LIS-ORY LISBON PARIS 4000 4000.05
我使用dplyr
库和distinct
,与我的乘客数和收入一样正常,与下面的代码一样:
library(dplyr)
data4 <- distinct(data2, Total.Passengers, Total.Revenue)
然而,我的真实数据集有数百万行,有时候,同一城市对的乘客数量并不完全相同(小数差)。但是,我仍然需要过滤数据并只保留一条记录,因此我不会计算乘客和收入的两倍。
虽然,我正在寻找一个允许我根据Origin和Destination或City.Pair进行过滤的功能。
作为我的试验的一部分,我尝试通过合并双倍的数据集来使用anti_join
函数,但它确实保留了所有行。我也试过union
,但结果相同。
data3<- data2
data5<- anti_join(data2, data3, by=c("Origin.City" = "Destination.City", "Destination.City" = "Origin.City"))
我想要的输出应该是以下内容:
City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
1 LIS-BRU LISBON BRUSSELS 100 100.66
2 LIS-LHR LISBON LONDON 5000 5000.25
3 LAD-LIS LUANDA LISBON 200 200.75
4 FAO-MAN FARO MANCHESTER 4000 4000.1
5 LIS-ORY LISBON PARIS 4000 4000.05
这项任务的最佳功能是什么?或者我可以在实际代码中纠正什么?
谢谢!
修改
如何更改代码以将其他条件包含在过滤中? 让我们说一行是编码的,我也希望根据该列进行子集/过滤。
以下是新的数据框:
data2<-matrix(NA, nrow = 10, ncol=6)
colnames(data2)<-c("City.Pair", "Origin.City", "Destination.City", "Total.Passengers", "Total.Revenue", "Code")
data2[,1] <- c("LIS-BRU","LIS-LHR","LAD-LIS", "LIS-LAD", "FAO-MAN", "MAN-FAO","LIS-ORY","LAD-LIS", "LAD-LIS", "LIS-LAD")
data2[,2]<- c("LISBON", "LISBON", "LUANDA", "LISBON", "FARO", "MANCHESTER", "LISBON","LUANDA", "LUANDA", "LISBON")
data2[,3] <- c("BRUSSELS","LONDON", "LISBON", "LUANDA", "MANCHESTER", "FARO", "PARIS","LISBON", "LISBON", "LUANDA")
data2[,4] <- c(100, 5000, 200, 200, 4000, 4000, 4000, 20, 40, 40)
data2[,5] <- c(100.66, 5000.25, 200.75, 200.75, 4000.10, 4000.10, 4000.05, 20.5, 40.8, 40.8)
data2[,6] <- c("F", "G","F", "F", "A", "A", "P", "H", "I", "I")
data2<-data.frame(data2)
data2
City.Pair Origin.City Destination.City Total.Passengers Total.Revenue Code
1 LIS-BRU LISBON BRUSSELS 100 100.66 F
2 LIS-LHR LISBON LONDON 5000 5000.25 G
3 LAD-LIS LUANDA LISBON 200 200.75 F
4 LIS-LAD LISBON LUANDA 200 200.75 F
5 FAO-MAN FARO MANCHESTER 4000 4000.1 A
6 MAN-FAO MANCHESTER FARO 4000 4000.1 A
7 LIS-ORY LISBON PARIS 4000 4000.05 P
8 LAD-LIS LUANDA LISBON 20 20.5 H
9 LAD-LIS LUANDA LISBON 40 40.8 I
10 LIS-LAD LISBON LUANDA 40 40.8 I
所以期望的输出应该如下:
City.Pair Origin.City Destination.City Total.Passengers Total.Revenue Code
1 LIS-BRU LISBON BRUSSELS 100 100.66 F
2 LIS-LHR LISBON LONDON 5000 5000.25 G
3 LAD-LIS LUANDA LISBON 200 200.75 F
5 FAO-MAN FARO MANCHESTER 4000 4000.10 A
7 LIS-ORY LISBON PARIS 4000 4000.05 P
8 LAD-LIS LUANDA LISBON 20 20.50 H
9 LAD-LIS LUANDA LISBON 40 40.80 I
我正在进行多项试验,但无法同时对两列进行过滤。这是我的代码:
dat1<-
data2 %>%
group_by(Code, City.Pair, Origin.City, Destination.City) %>%
filter(Origin.City!=Destination.City & Destination.City!=Origin.City) %>%
summarise(Passengers=sum(Total.Passengers),
Revenue=sum(Total.Revenue))
答案 0 :(得分:0)
我们可以将'City.Pair'拆分为' - ',sort
list
输出,paste them together to give a
vector`中的元素,检查重复项('i1')和使用逻辑向量对'data2'的行进行子集化。
i1 <- !duplicated(apply(sapply(strsplit(as.character(data2$City.Pair), "-"),
sort), 2, paste, collapse="-"))
data2[i1,]
# City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
#1 LIS-BRU LISBON BRUSSELS 100 100.66
#2 LIS-LHR LISBON LONDON 5000 5000.25
#3 LAD-LIS LUANDA LISBON 200 200.75
#5 FAO-MAN FARO MANCHESTER 4000 4000.1
#7 LIS-ORY LISBON PARIS 4000 4000.05
或将separate
与pmin/pmax
library(dplyr)
library(tidyr)
separate(data2, City.Pair, into = c("City", "City2"), remove = FALSE) %>%
filter(!duplicated(pmin(City, City2), pmax(City, City2))) %>%
select(-City, -City2)
# City.Pair Origin.City Destination.City Total.Passengers Total.Revenue
#1 LIS-BRU LISBON BRUSSELS 100 100.66
#2 LIS-LHR LISBON LONDON 5000 5000.25
#3 LAD-LIS LUANDA LISBON 200 200.75
#4 FAO-MAN FARO MANCHESTER 4000 4000.1
#5 LIS-ORY LISBON PARIS 4000 4000.05