Question

这是我的正则表达式： https://regex101.com/r/UjWanf/1

(^\d+?\.?\d{0,2})([A-Za-z]+|\s[A-Za-z]+)

逃到R：

"(^\\d+?\\.?\\d{0,2})([A-Za-z]+|\\s[A-Za-z]+)"

在regex101中似乎一切正常，但是当我在R中使用strapplyc函数应用相同的模式时，它不会捕获整个字符串。

示例字符串：

50ml tomato sauce
5g chillies
5 Units tartar sauce
0.25 Units pasta sauce

我想分别拿到50ml，5g，5个单位和0.25个单位。

在R中，当我使用库strapplyc中的gsubfn在上面的正则表达式链接中应用模式时，我的输出为50m，5g，5 U，0.25 U.这是我的示例码： a = c（＆＃34;成分1＆＃34;，成分2＆＃34;，＆＃34;成分3＆＃34;，＆＃34;成分4＆＃34;） b = c（＆＃34; 50ml番茄酱＆＃34;，＆＃34; 5g辣椒＆＃34;，＆＃34; 5单位塔塔酱＆＃34;，＆＃34; 0.25单位意大利面酱＆＃34;）合并＆lt; - data.frame（a，b）`

library(gsubfn)
pattern_reg2 <- "(^\\d+?\\.?\\d{0,2})(\\s?[A-Za-z]+)"
consolidated$c <- strapplyc(consolidated$b, pattern_reg2) 
#c column with the desired results

有什么建议吗？

Answer 1

我不熟悉strapplyc，但看起来它不能正常工作。您是否尝试过使用R的基本正则表达式函数？

library(RCurl)
#Load this webpage into a string so I can match the patterns you listed
test_file <- getURL("https://stackoverflow.com/questions/48798279/regex-working-in-regex101-not-in-r")
rgx = "(\\d+?\\.?\\d{0,2})([A-Za-z]+|\\s[A-Za-z]+)" #removed the ^ to allow whole string matching
rgx_result <- gregexpr(rgx,test_file)
result <- regmatches(test_file, rgx_result)
result[[1]][317:321] #only the answers from the strings you were asking to match

返回：

[1] "50ml"     "5g"       "5 Units"  "25 Units" "50ml"

这是正常的。你有什么理由需要使用strapplyc吗？

添加了在列表中工作的示例：

test_list <- list('50ml tomato sauce','5g chillies',
           '5 Units tartar sauce',
           '0.25 Units pasta sauce')
for(i in 1:length(test_list)) {
    rgx_result <- gregexpr(rgx,test_list[[i]])
    print(regmatches(test_list[[i]], rgx_result))
}

我确信使用apply功能可以更干净地完成这项工作，但我对这些功能并不是很好。

正则表达式在Regex101中工作而不是在R中

1 个答案: