过滤r中同一列中的多个字符串

时间:2019-06-08 05:14:56

标签: r string filter subset

我的大型数据集(Groceries)中有一列,其中包含字符数据(Fruits),所有这些字符均是小写字母,并且都不含标点符号。

它看起来像这样:

# Groceries Data Frame
Id    Groceries$Fruits
1     apple orange banana lemon grapefruit
2     grapes tomato passion fruit
3     strawberry orange kiwi
4     lemon orange passion fruit grapefruit lime
5     lemon orange passion fruit grapefruit lime peach
  ...

我试图从“水果”列中选择所有包含5种特定水果(橙色,酸橙,柠檬,葡萄柚和百香果)的行(其中3,320行)。最初,我只对包含所有这些水果中的5个而没有其他水果的行感兴趣。因此,这5个中唯一应过滤/设置的行将是第4行。结果不必按任何特定顺序排列。

数据实际上是测试的答案,所以最终我对确定谁得到0/5水果,谁得到1 / 5、2 / 5等感兴趣...

到目前为止,我已经尝试了2种方法,但均无济于事。 首先,我尝试使用grep(),但结果数据框中没有存储任何行。

# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit, 
grapefruit", Groceries$Fruits), ]

然后我尝试使用filter(),但是选定的行并不只包含我要查找的5个水果,而是选择了包含5个水果中的任何一个的所有行。

# 2nd attempt with filter
library(dplyr)
library(stringr)
CorrectFruits <- c("lemon", "orange", "passion fruit", "grapefruit", 
"lime")

filter <- Groceries %>%
  select(Id, Fruits) %>%
  filter(str_detect(tolower(Fruits), pattern = CorrectFruits))

我最初得到的结果是一个新的DF,其中包含Groceries表中的所有列,但只有那些正确选择了所有5种水果的人的行。

接下来,选择相反的对象会很酷;每个没有得到全部5个正确答案的人。

最后,我希望能够将正确比例的参与者归为一类。即第1行正确3,第2行正确1,第3行正确1。

任何帮助将不胜感激!

下面是一些列的示例:

# Groceries
Id   Age      Nationality    Colour question   Fruits question
1    26-35    Canadian       Red               apple orange banana lemon grapefruit
2    26-35    US             Blue              grapes tomato passion fruit
3    46-55    Canadian       Red               strawberry orange kiwi
4    55+      US             Red               lemon orange passion fruit grapefruit lime
5    36-45    British        Green             lemon orange passion fruit grapefruit lime peach

3 个答案:

答案 0 :(得分:1)

可能需要对所有5种结果都有一些额外结果的答案做进一步的说明,但这应该会对您有所帮助。我将“百香果”的所有实例替换为“百香果”以使其变得更容易:

df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit", 
                   "lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)

给出

ID                                          Fruits Count
1            apple orange banana lemon grapefruit     3
2                      grapes tomato passionfruit     1
3                          strawberry orange kiwi     1
4       lemon orange passionfruit grapefruit lime     5
5 lemon orange passionfruit grapefruit lime peach     0

第一行进行百香果的替换,然后str_count计算df$Fruit中所有正确水果的出现。最后,如果所有5种水果都是正确的,但有多余的水果,Count将重置为0。

答案 1 :(得分:1)

这是看到别人的天才解决方案之后的答案。

ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple", 
            "apple", 
            "apple orange kiwi fifth",
            "orange apple pineapple kiwi fifth",
            "pineapple orange apple fifth kiwi"
            )
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df

heds1 的回复看起来不错。但是,您要小心使用诸如grepl之类的字符串,因为它可能返回复合词。例如,考虑菠萝一词;它包含 pine apple 。请注意,在这里搜索苹果会返回菠萝。

filter(df, grepl("apple", Fruits))

  ID   Age Nationality     Color                            Fruits
1  1 26-35    Canadian   Correct                         pineapple
2  2 26-35          US Incorrect                             apple
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth
4  4   55+          US   Correct orange apple pineapple kiwi fifth
5  5 56-45     British  Correect pineapple orange apple fifth kiwi

sumshyftw 提供的答案很棒。我喜欢从 sumshyftw 中学到一些东西。但是,为了说明我的观点,无限制的字符串搜索可能会使您的计数混乱:

CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     1
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     1
4  4   55+          US   Correct orange apple pineapple kiwi fifth     2
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     2

请注意,尽管唯一正确的水果是苹果,但它仍将菠萝视为正确答案。为了克服这个问题,您想用\\b来包装您的单词。

CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     0
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     1
4  4   55+          US   Correct orange apple pineapple kiwi fifth     1
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     1

R不再将菠萝视为苹果。

但为了记录在案, sumshyftw 应该为我的示例中的难点做出贡献:

CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df

  ID   Age Nationality     Color                            Fruits Count
1  1 26-35    Canadian   Correct                         pineapple     1
2  2 26-35          US Incorrect                             apple     1
3  3 46-55    Canadian Incorrect           apple orange kiwi fifth     4
4  4   55+          US   Correct orange apple pineapple kiwi fifth     5
5  5 56-45     British  Correect pineapple orange apple fifth kiwi     5

仅显示所有五个水果的水果:

df2 <- filter(df, df$Count == 5)
df2

  ID   Age Nationality    Color                            Fruits Count
1  4   55+          US  Correct orange apple pineapple kiwi fifth     5
2  5 56-45     British Correect pineapple orange apple fifth kiwi     5

答案 2 :(得分:0)

这是将grepl与目标关键字列表结合使用的一种方法。

df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2", 
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L, 
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear", 
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]

然后,您可以通过将match_rows更改为5来更改4向量中的条件,例如将四个水果配对,等等。