Question

我有一个字符串向量。我想在“#stalls”之后提取一个数字：这些数字位于字符串的中间或末尾。

x <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40")

这是我的试用版，但还不够。感谢您的帮助。

gsub(".*\\# of Stalls: ", "", x)

Answer 1

由于它是HTML，您可以使用rvest或其他HTML解析器首先提取您想要的节点，这使得提取数字变得微不足道。对于这类工作，XPath选择器和函数比CSS更灵活。

library(rvest)

x %>% paste(collapse = '<br/>') %>% 
    read_html() %>% 
    html_nodes(xpath = '//text()[contains(., "# of Stalls:")]') %>% 
    html_text() %>% 
    readr::parse_number()
#> [1] 244  40

Answer 2

我们匹配一个或多个不是#（[^#]+）的字符，从字符串的开头（^）后跟#后跟零或更多字符不是数字（[^0-9]*）后跟一个或多个数字（[0-9]+）作为一组（(...)）捕获，后跟其他字符（.*）并将其替换为捕获组

的反向引用（\\1）

as.integer(sub("^[^#]+#[^0-9]*([0-9]+).*", "\\1", x))
#[1] 244  40

如果字符串更具体，那么我们可以指定它

as.integer(sub("^[^#]+# of Stalls:\\s+([0-9]+).*", "\\1", x))
#[1] 244  40

Answer 3

有很多方法可以解决这个问题，我将使用stringr包来解决它。第一个str_extract将获取值： [1]＆＃34; #Stalls：244＆＃34; ＆＃34;摊位数：40＆＃34;然后第二个str_extract提取字符串中唯一可用的数字部分。

但我不清楚您是要提取字符串还是替换字符串。如果你想要extarct，下面的字符串将适合你。如果您想要替换字符串，则需要使用str_replace

library(stringr)
as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))

如果你想要替换字符串，那么你应该这样做：

str_replace(x,"#\\D*(\\d{1,})","\\1")

<强>输出：

提取的输出：

 > as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
    [1] 244  40

替换输出：

> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"    
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"

Answer 4

以下是一些解决方案。（1）和（1a）是问题中代码的变体。（2）和（2a）采取相反的方法，而不是删除我们不想要的东西，而不是我们想要的东西。

1）gsub 问题中的代码删除了数字之前的部分，但之后没有删除部分。我们可以修改它以便同时执行这两项操作。我们添加的$(document).ready(function(){ var divpost = window.location.hash.substr(1); if($.isNumeric(divpost)){ $('#reply_' + divpost).css('background-color', '#EDA2FF'); } });部分就是这样做的。请注意，|\\D.*$匹配任何非数字。

"\\D"

1a）sub 在两个单独的as.integer(gsub(".*# of Stalls: |\\D.*$", "", xx)) ## [1] 244 40来电中执行此操作。内部子来自问题，外部sub从数字后面删除第一个非数字。

sub

2）strcapture 使用R的开发版本中提供的这种方法，我们可以大大简化正则表达式。我们指定与捕获组的匹配（括号中的部分）。 as.integer(sub("\\D.*$", "", sub(".*# of Stalls: ", "", xx))) ## [1] 244 40将返回与捕获组对应的部分，并从中创建data.frame。第三个参数是一个原型结构，它用于知道它应该返回整数。请注意strcapture匹配任何数字。

"\\d"

2a）strapply gsubfn包中的strapply函数类似于strcapture("# of Stalls: (\\d+)", xx, list(stalls = integer())) ## stalls ## 1 244 ## 2 40但使用了apply范例，其中第一个参数是输入字符串，第二个是模式，第三个参数是应用于捕获组的功能。

strcapture

注意：使用的输入library(gsubfn) strapply(xx, "# of Stalls: (\\d+)", as.integer, simplify = TRUE) ## [1] [1] 244 40与问题中的xx相同：

在R中的字符串的中间或末尾提取数字

4 个答案: