如何从R中的向量中的每个字符串中提取第一个数字?

时间:2014-09-17 08:05:58

标签: regex r vector

我是R的正则表达式的新手。这里有一个向量,我有兴趣在向量的每个字符串中提取数字的第一次出现。

我有一个名为“shootsummary”的矢量,看起来像这样。

> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.                                         
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.                           
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.      
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building's parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.  

每个字符串中第一次出现的数字表示个人的“年龄”,我有兴趣从这些字符串中提取年龄而不将它们与列出的行中的其他数字混合。

我用过:

as.numeric(gsub("\\D", "", shootsummary))

结果是:

[1]  34128     42     23     27   6419  

我正在寻找一个看起来像这样的结果,只有从句子中提取的年龄,而不提取年龄后出现的其他数字。

[1]  34     42     23     27   64

7 个答案:

答案 0 :(得分:3)

stringi会更快

library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"

答案 1 :(得分:2)

str_extract的一个选项stringras.numeric换行。

> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64

更新回答您在本回答的评论中提出的问题,这里有一点解释。有关函数的完整说明,请参见其帮助文件。

  • str_extract返回正则表达式的第一个匹配项。它在第一个参数的字符向量上进行矢量化。
  • 正则表达式[0-9]+匹配任何字符:' 0'到' 9' (1次或更多次)
  • as.numeric将生成的字符向量更改为数字向量。

答案 2 :(得分:2)

您可以尝试以下sub命令,

> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."              
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"

模式说明:

  • ^声称我们处在一条线的起点。
  • \D*匹配零个或多个非数字字符。
  • (\d+)然后将以下一个或多个数字捕获到第1组(第一个数字)。
  • .*匹配任何字符零次或多次。
  • $断言我们在一行的末尾。
  • 最后,所有匹配的字符都被第一组中出现的字符替换。

答案 3 :(得分:1)

怎么样

splitbycomma <- strsplit(shootsummary, ",")
as.numeric(  sapply(splitbycomma, "[", 2)  )

答案 4 :(得分:1)

R&#39; regmatches()方法返回每个元素中第一个正则表达式匹配的向量:

regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));

答案 5 :(得分:1)

您可以使用sub

test <- ("xff 34 sfsdg 352 efsrg")

sub(".*?(\\d+).*", "\\1", test)
# [1] "34"

正则表达式如何工作?

.匹配任何字符。量词*表示任意数量的事件。 ?用于匹配\\d(数字)的第一个匹配项下的所有字符。量词+表示一次或多次出现。 \\d周围的括号是第一个匹配组。其后可能会有其他字符(.*)。第二个参数(\\1)用第一个匹配组(即第一个数字)替换整个字符串。

答案 6 :(得分:0)

您可以使用str_first_number()包中的strex函数很好地执行此操作,或者对于更一般的需求,可以使用str_nth_number()函数。

pacman::p_load(strex)
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
                  "Pedro Vargas, 42, set fire to his apartment, killed six ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "Dennis Clark III, 27, shot and killed his girlfriend ...",
                  "Kurt Myers, 64, shot six people in neighboring ..."
)
str_first_number(shootsummary)
#> [1] 34 42 23 23 27 64
str_nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64

reprex package(v0.2.0)创建于2018-09-03。