Question

请，我想问一个关于R中正则表达式的问题。下面是代码：

string <- "BROCA DIN 338 4,00 MM"

string_list <- regmatches(x=string, gregexpr("[0-9]+\\s\\w+", text=string))

words <- sapply(string_list, toString)
words[is.na(string_list)] <- NA

words <- gsub(pattern = "[[:punct:]]+", replacement="", x=words)

regmatches(x=words, gregexpr("[0-9]+[[:space:]]+\\w+", text=words))

此后，结果如下：

[1] "338 4" "00 MM"

我的问题是我必须通过以下方式使用grepl：

dose_1KG <- subset(new_df_1, (grepl("338 4 MM",new_df_1$xprod,fixed=TRUE)==TRUE) |

                         (grepl("338 4MM",new_df_1$xprod,fixed=TRUE)==TRUE) |

                         (grepl("338 4 0 MM",new_df_1$xprod,fixed=TRUE)==TRUE) |

                         (grepl("338 4 0MM",new_df_1$xprod,fixed=TRUE)==TRUE) |

                         (grepl("338 4 00 MM",new_df_1$xprod,fixed=TRUE)==TRUE) |

                         (grepl("338 4 00MM",new_df_1$xprod,fixed=TRUE)==TRUE))

请问有没有一种方法可以使用正则表达式或R中的某些函数自动执行此操作而无需插入“ 338 4 00 MM”的几种组合？

非常感谢您。

最诚挚的问候！

Answer 1

该操作可能是尝试选择xprod变量与pattern参数中的那些字符串之一（完全）匹配的行。如果是这样，您可以通过以下方式经济地做到这一点：

dose_1KG <- subset(new_df_1, xprod %in% 
                                 c("338 4 MM","338 4MM","338 4 0 MM","338 4 00 MM","338 4 00MM")

使用$将表达式中的变量从subset中的同一数据帧中拉到第二个参数是错误的。 subset的全部要点是允许人们避免这种必要。如果问题是如何识别那些表达式可能是部分匹配的行，则可能需要使用grepl，但仍可以通过paste0调用进行简化，该调用将“ |”连接起来运算符（再次不使用“ $”）：

dose_1KG <- subset( new_df_1, 
                     grepl( paste0( c("338 4 MM","338 4MM","338 4 0 MM","338 4 00 MM","338 4 00MM"),
                            collapse="|", fixed = TRUE), # suspect the fixed argument unnecessary
                            xprod)
                   )

注意事项：在没有MCVE的情况下未经测试，

R中的正则表达式（字符串的多个组合）

1 个答案: