我的大型数据集(Groceries)中有一列,其中包含字符数据(Fruits),所有这些字符均是小写字母,并且都不含标点符号。
它看起来像这样:
# Groceries Data Frame
Id Groceries$Fruits
1 apple orange banana lemon grapefruit
2 grapes tomato passion fruit
3 strawberry orange kiwi
4 lemon orange passion fruit grapefruit lime
5 lemon orange passion fruit grapefruit lime peach
...
我试图从“水果”列中选择所有包含5种特定水果(橙色,酸橙,柠檬,葡萄柚和百香果)的行(其中3,320行)。最初,我只对包含所有这些水果中的5个而没有其他水果的行感兴趣。因此,这5个中唯一应过滤/设置的行将是第4行。结果不必按任何特定顺序排列。
数据实际上是测试的答案,所以最终我对确定谁得到0/5水果,谁得到1 / 5、2 / 5等感兴趣...
到目前为止,我已经尝试了2种方法,但均无济于事。 首先,我尝试使用grep(),但结果数据框中没有存储任何行。
# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit,
grapefruit", Groceries$Fruits), ]
然后我尝试使用filter(),但是选定的行并不只包含我要查找的5个水果,而是选择了包含5个水果中的任何一个的所有行。
# 2nd attempt with filter
library(dplyr)
library(stringr)
CorrectFruits <- c("lemon", "orange", "passion fruit", "grapefruit",
"lime")
filter <- Groceries %>%
select(Id, Fruits) %>%
filter(str_detect(tolower(Fruits), pattern = CorrectFruits))
我最初得到的结果是一个新的DF,其中包含Groceries表中的所有列,但只有那些正确选择了所有5种水果的人的行。
接下来,选择相反的对象会很酷;每个没有得到全部5个正确答案的人。
最后,我希望能够将正确比例的参与者归为一类。即第1行正确3,第2行正确1,第3行正确1。
任何帮助将不胜感激!
下面是一些列的示例:
# Groceries
Id Age Nationality Colour question Fruits question
1 26-35 Canadian Red apple orange banana lemon grapefruit
2 26-35 US Blue grapes tomato passion fruit
3 46-55 Canadian Red strawberry orange kiwi
4 55+ US Red lemon orange passion fruit grapefruit lime
5 36-45 British Green lemon orange passion fruit grapefruit lime peach
答案 0 :(得分:1)
可能需要对所有5种结果都有一些额外结果的答案做进一步的说明,但这应该会对您有所帮助。我将“百香果”的所有实例替换为“百香果”以使其变得更容易:
df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit",
"lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
给出
ID Fruits Count
1 apple orange banana lemon grapefruit 3
2 grapes tomato passionfruit 1
3 strawberry orange kiwi 1
4 lemon orange passionfruit grapefruit lime 5
5 lemon orange passionfruit grapefruit lime peach 0
第一行进行百香果的替换,然后str_count计算df$Fruit
中所有正确水果的出现。最后,如果所有5种水果都是正确的,但有多余的水果,Count
将重置为0。
答案 1 :(得分:1)
这是看到别人的天才解决方案之后的答案。
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple",
"apple",
"apple orange kiwi fifth",
"orange apple pineapple kiwi fifth",
"pineapple orange apple fifth kiwi"
)
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
heds1 的回复看起来不错。但是,您要小心使用诸如grepl
之类的字符串,因为它可能返回复合词。例如,考虑菠萝一词;它包含 pine 和 apple 。请注意,在这里搜索苹果会返回菠萝。
filter(df, grepl("apple", Fruits))
ID Age Nationality Color Fruits
1 1 26-35 Canadian Correct pineapple
2 2 26-35 US Incorrect apple
3 3 46-55 Canadian Incorrect apple orange kiwi fifth
4 4 55+ US Correct orange apple pineapple kiwi fifth
5 5 56-45 British Correect pineapple orange apple fifth kiwi
sumshyftw 提供的答案很棒。我喜欢从 sumshyftw 中学到一些东西。但是,为了说明我的观点,无限制的字符串搜索可能会使您的计数混乱:
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 2
5 5 56-45 British Correect pineapple orange apple fifth kiwi 2
请注意,尽管唯一正确的水果是苹果,但它仍将菠萝视为正确答案。为了克服这个问题,您想用\\b
来包装您的单词。
CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 0
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 1
5 5 56-45 British Correect pineapple orange apple fifth kiwi 1
R不再将菠萝视为苹果。
但为了记录在案, sumshyftw 应该为我的示例中的难点做出贡献:
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 4
4 4 55+ US Correct orange apple pineapple kiwi fifth 5
5 5 56-45 British Correect pineapple orange apple fifth kiwi 5
仅显示所有五个水果的水果:
df2 <- filter(df, df$Count == 5)
df2
ID Age Nationality Color Fruits Count
1 4 55+ US Correct orange apple pineapple kiwi fifth 5
2 5 56-45 British Correect pineapple orange apple fifth kiwi 5
答案 2 :(得分:0)
这是将grepl
与目标关键字列表结合使用的一种方法。
df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2",
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L,
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear",
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]
然后,您可以通过将match_rows
更改为5
来更改4
向量中的条件,例如将四个水果配对,等等。