通过不影响其中包含该单词的其他名称从字符串中删除单词

时间:2017-06-13 12:28:43

标签: r

CompanyName            Desired Output
Abbey Company.Com      abbey company
Manisd Company .com    manisd company
Idely.com              idely

我如何删除.com,同时注意公司的“com”不受影响。 我试过以下代码

     stopwords = c("limited"," l.c.", " llc","corporation"," &"," ltd.","llp ",
                      "l.l.c","incorporated","association","s.p.a"," l.p.","l.l.l.p","p.a  ","p.c  ",
                      "chtd  ","chtd.  ","r.l.l.l.p  ","rlllp  ", "the "," lmft", " inc.", ".com")

   file_new1$CompanyName<-gsub(paste0(stopwords,collapse = "|"),"", file_new1$CompanyName)

已经参考此链接

enter image description here

2 个答案:

答案 0 :(得分:3)

你可以gsub("\\.Com","",dt$CompanyName)。假设您的data.table被称为dt

<强>更新

另一个解决方案可能是在点(“。”)之前只保留“东西”。

所以

CompanyName <- data.table(V1=c("Abbey Company.Com", "Manisd Company .com", "Idely.com"))

> CompanyName
                    V1
1:   Abbey Company.Com
2: Manisd Company .com
3:           Idely.com

CompanyName$V1 <- sel_strsplit(CompanyName$V1,"\\.",1)
> CompanyName
                V1
1:   Abbey Company
2: Manisd Company 
3:           Idely

如果您有“.com”,“。com”或“.co.uk”等,那么您无需关心

答案 1 :(得分:3)

如果你有:

CompanyName <- c("Abbey Company.Com", "Manisd Company .com", "Idely.com")

你可以尝试:

gsub(paste0(gsub("\\.","\\\\.",stopwords),collapse = "|"),"",
     tolower(CompanyName))
#[1] "abbey company"   "manisd company " "idely"