我有一个数据框df,如下所述。
a <- c(1:6)
b <- c("Audi,BMW,Skoda, Rackets,Toy,Football",
"Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby",
"Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet",
"Lemon,Yamaha,Table,Kawasaki,Chair,Fruits",
"Ford, chevrolet,Bread,Ducati,Tesla,Hyundai",
"Honey,Apple,Alcohol,cake,Sweets, Mango")
df <- data.frame(a,b)
*
我还有两个列表,分别包含汽车和自行车的品牌名称。
cars <- c("Audi","BMW","Ford","Skoda","Mazda","chevrolet","Mercedes","Volkswagen","Tesla","Hyundai","Lamborghini","Mini-Cooper","Lexus")
motorbike <- c("Yamaha","Suzuki","Kawasaki","Harley-Davidson","Ducati","Aprilia","KTM", "Triumph","Piaggio","Hyosung","Vespa","MV-Agusta")
我使用grepl和ifelse来匹配df $ b中两个列表中的单词,并为每个行分配一个匹配的值。
df$c<-ifelse(grepl(paste(cars, collapse="|"), df$b), "cars",
ifelse(grepl(paste(motorbike, collapse="|"),df$b), "bikes","others"))
现在,我想提出一个条件,如果每行匹配4个或4个以上的单词,则仅在df $ c中分配一个值(car,bike)。我希望我的df像这样:
structure(list(a = 1:6, b = structure(c(1L, 6L, 5L, 4L, 2L, 3L
), .Label = c("Audi,BMW,Skoda, Rackets,Toy,Football", "Ford, chevrolet,Bread,Ducati,Tesla,Hyundai",
"Honey,Apple,Alcohol,cake,Sweets, Mango", "Lemon,Yamaha,Table,Kawasaki,Chair,Fruits",
"Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet", "Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby"
), class = "factor"), c = c("others", "bikes", "cars", "others",
"cars", "others")), row.names = c(NA, 6L), class = "data.frame")
答案 0 :(得分:2)
这有帮助吗?当然,您可以删除amountcars和amountmotors列。而且,如果同时拥有> 3辆汽车和> 3辆电动机,您是否希望它永远不会发生?根据评论,我现在更新了答案。
library(stringr)
df$amountcars <- str_count(df$b, paste(cars, collapse="|"))
df$amountmotors <- str_count(df$b, paste(motorbike, collapse="|"))
df$c <- ifelse(df$amountcars > 3 & df$amountcars > df$amountmotors, "cars", ifelse(df$amountmotors > 3 & df$amountmotors > df$amountcars, "bikes", "others"))
df
a b amountcars amountmotors c
1 1 Audi,BMW,Skoda, Rackets,Toy,Football 3 0 others
2 2 Suzuki,Kawasaki,Ducati,Aprilia,Baseball, Rugby 0 4 bikes
3 3 Mazda, Ford, chevrolet,Mercedes,Gloves,Helmet 4 0 cars
4 4 Lemon,Yamaha,Table,Kawasaki,Chair,Fruits 0 2 others
5 5 Ford, chevrolet,Bread,Ducati,Tesla,Hyundai 4 1 cars
6 6 Honey,Apple,Alcohol,cake,Sweets, Mango 0 0 others
根据评论,如果您喜欢9个字符串: 首先用字符串创建所有向量:
cars <- c("Audi","BMW","Ford","Skoda","Mazda","chevrolet","Mercedes","Volkswagen","Tesla","Hyundai","Lamborghini","Mini-Cooper","Lexus")
motorbike <- c("Yamaha","Suzuki","Kawasaki","Harley-Davidson","Ducati","Aprilia","KTM", "Triumph","Piaggio","Hyosung","Vespa","MV-Agusta")
然后将它们放在列表中,并添加名称
list1 <- list(cars, motorbike)
names(list1) <- c("cars", "motorbike")
最后,运行以下代码:
df$d <-
ifelse(apply(sapply(list1, function(x) str_count(df$b, paste0(x, collapse = "|"))), 1, max) > 3,
apply(sapply(list1, function(x) str_count(df$b, paste0(x, collapse = "|"))), 1, function(x) names(list1)[which.max(x)]),
"others")
基本上,它从向量之一计算最大字符串数,如果大于3,则分配适当的名称,否则分配“其他”。