我有一个我需要清理的角色向量。具体来说,我想删除“投票”之前的数字。请注意,该数字有一个逗号分隔数千,因此更容易将其视为字符串。
我知道gsub(“*。投票”,“”,文字)会删除所有内容,但我该如何删除该号码?另外,如何将重复的空间折叠成一个空格?
感谢您的帮助!
示例数据:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
答案 0 :(得分:1)
您可以使用
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
请参阅online R demo和online regex demo。
<强>详情
(\\s){2,}
- 匹配2个或更多空白字符,同时捕获将使用替换模式中的\1
占位符重新插入的最后一个匹配项|
- 或\\d
- 数字[0-9,]*
- 0个或更多数字或逗号\\s*
- 0+空白字符(Votes)
- 第2组(将使用\2
占位符在输出中恢复):Votes
子字符串。请注意trimws
将删除所有前导/尾随空格。
答案 1 :(得分:0)
最简单的方法是stringr
:
> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
要做同样的事情但只提取数字,请将其包装在gsub
:
> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
这个版本会删除“投票”之前的所有数字,即使它们中有逗号或句号:
> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
如果你也想要标签,那么就扔掉gsub
部分:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
如果你想提取所有数字:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"