我正在为我的问题寻找直观的解决方案。 我有一个巨大的单词列表,我必须根据一些标准插入一个特殊字符。 因此,如果一个单元格中出现一个两个/三个字母的单词,我想添加" +"左右两边
实施例
global b2b banking
会转换为global +b2b+ banking
how to finance commercial ale estate
会转换为how +to+ finance commercial +ale+ estate
以下是示例数据集:
sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
"W Hotels")
data <- data.frame(sample)
另外可以删除长度为1的字符行吗? 例如:
W Hotels
对于我尝试用gsub删除它们的所有单字母单词,
gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample)
这应该从数据集集中删除。
非常感谢任何帮助。
修改1
感谢您的帮助,我添加了更多内容:
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)]
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data
sample
1 commercial++funding
2 global+++b2b+++banking
3 how++++to+++finance++commercial+++ale+++estate
4 international++currency++account
5 miami++imports++banking
6 hsbc++supply++chain++financing
7 international++business++expansion
8 grow++business+++in++++us+++banking
9 commercial++trade++asia++pacific
10 business++line+++of+++credits++hsbc
11 britain++commercial++banking
12 fx+++settlement++hsbc
不知怎的,我无法删除&#34; +,&#34;用&#34;,&#34;用gsub?我究竟做错了什么 ?
所以"fx+,settlement,hsbc"
应该是"fx+settlement,hsbc"
,但它正在替换,而且还有其他++。
答案 0 :(得分:2)
您需要分两步执行此操作:删除包含1个字母的整个单词的项目,然后在2-3个字母单词周围添加+
。
使用
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alnum:]]\\b",sample)]
sample <- gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)
data <- data.frame(sample)
data
请参阅R demo
sample[!grepl("\\b[[:alnum:]]\\b",sample)]
删除包含字边界(\b
),字母([[:alnum:]]
)和字边界图案的项目。
gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)
行替换所有2-3个字母的整个单词,并用+
括起来。
结果:
sample
1 commercial funding
2 global +b2b+ banking
3 +how+ +to+ finance commercial +ale+ estate
4 international currency account
5 miami imports banking
6 hsbc supply chain financing
7 international business expansion
8 grow business +in+ +Us+ banking
9 commercial trade Asia Pacific
10 business line +of+ credits hsbc
11 Britain commercial banking
12 +fx+ settlement hsbc
请注意,W Hotels
和opening a commercial account
已被滤除。
回答编辑
您在代码中添加了一些替换操作,但是您正在使用文字字符串替换,因此,您只需要传递fixed=TRUE
参数:
sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)
否则,+
被视为正则表达式量词,必须进行转义才能被视为文字加号。
此外,如果您需要从字符串的开头删除所有 +
,请使用
sample <- sub("^\\++", "", sample)