Question

我有一个我需要清理的角色向量。具体来说，我想删除“投票”之前的数字。请注意，该数字有一个逗号分隔数千，因此更容易将其视为字符串。

我知道gsub（“*。投票”，“”，文字）会删除所有内容，但我该如何删除该号码？另外，如何将重复的空间折叠成一个空格？

感谢您的帮助！

示例数据：

text <- "STATE QUESTION NO. 1                       Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee?                    558,586 Votes"

Answer 1

您可以使用

text <- "STATE QUESTION NO. 1                       Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee?                    558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"

请参阅online R demo和online regex demo。

<强>详情

(\\s){2,} - 匹配2个或更多空白字符，同时捕获将使用替换模式中的\1占位符重新插入的最后一个匹配项
| - 或
\\d - 数字
[0-9,]* - 0个或更多数字或逗号
\\s* - 0+空白字符
(Votes) - 第2组（将使用\2占位符在输出中恢复）：Votes子字符串。

请注意trimws将删除所有前导/尾随空格。

Answer 2

最简单的方法是stringr：

> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"

要做同样的事情但只提取数字，请将其包装在gsub：

中

> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"

这个版本会删除“投票”之前的所有数字，即使它们中有逗号或句号：

> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"

如果你也想要标签，那么就扔掉gsub部分：

> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) 
[1] "558,586 Votes"

如果你想提取所有数字：

> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1"       "15"      "202"     "558,586"

用R删除某个单词之前的字符串

2 个答案: