根据现有单词在R中插入特殊字符

时间:2017-03-02 08:57:52

标签: r regex string stringr

我正在为我的问题寻找直观的解决方案。 我有一个巨大的单词列表,我必须根据一些标准插入一个特殊字符。 因此,如果一个单元格中出现一个两个/三个字母的单词,我想添加" +"左右两边

实施例

global b2b banking会转换为global +b2b+ banking

how to finance commercial ale estate会转换为how +to+ finance commercial +ale+ estate

以下是示例数据集:

sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
 "W Hotels")
data <- data.frame(sample)

另外可以删除长度为1的字符行吗? 例如:

W Hotels

对于我尝试用gsub删除它们的所有单字母单词,

gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample) 

这应该从数据集集中删除。

非常感谢任何帮助。

修改1

感谢您的帮助,我添加了更多内容:

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)]
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data




                                          sample
1                             commercial++funding
2                          global+++b2b+++banking
3  how++++to+++finance++commercial+++ale+++estate
4                international++currency++account
5                         miami++imports++banking
6                  hsbc++supply++chain++financing
7              international++business++expansion
8             grow++business+++in++++us+++banking
9                commercial++trade++asia++pacific
10            business++line+++of+++credits++hsbc
11                   britain++commercial++banking
12                          fx+++settlement++hsbc

不知怎的,我无法删除&#34; +,&#34;用&#34;,&#34;用gsub?我究竟做错了什么 ? 所以"fx+,settlement,hsbc"应该是"fx+settlement,hsbc",但它正在替换,而且还有其他++。

1 个答案:

答案 0 :(得分:2)

您需要分两步执行此操作:删除包含1个字母的整个单词的项目,然后在2-3个字母单词周围添加+

使用

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alnum:]]\\b",sample)]
sample <- gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)
data <- data.frame(sample)
data

请参阅R demo

sample[!grepl("\\b[[:alnum:]]\\b",sample)]删除包含字边界(\b),字母([[:alnum:]])和字边界图案的项目。

gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)行替换所有2-3个字母的整个单词,并用+括起来。

结果:

                                       sample
1                          commercial funding
2                        global +b2b+ banking
3  +how+ +to+ finance commercial +ale+ estate
4              international currency account
5                       miami imports banking
6                 hsbc supply chain financing
7            international business expansion
8             grow business +in+ +Us+ banking
9               commercial trade Asia Pacific
10            business line +of+ credits hsbc
11                 Britain commercial banking
12                       +fx+ settlement hsbc

请注意,W Hotelsopening a commercial account已被滤除。

回答编辑

您在代码中添加了一些替换操作,但是您正在使用文字字符串替换,因此,您只需要传递fixed=TRUE参数:

sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)

否则,+被视为正则表达式量词,必须进行转义才能被视为文字加号。

此外,如果您需要从字符串的开头删除所有 +,请使用

sample <- sub("^\\++", "", sample)