我在r
中有以下数据框Id titles
1 emami paper mills slips 10% on dismal q4 numbers
2 jsw steel q4fy17 standalone net profit rises 173.33%
3 fmcg major hul q4fy17 standalone net profit rises 6.2
4 chennai petroleum, allsec tech slip 6-7% on poor q4
而且,我在矢量中有名字
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")
我希望将数据框列标题与向量字符串匹配,并在新列中打印相应的字符串。我想要的数据框是
Id titles names
1 emami paper mills slips 10% on dismal q4 numbers emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai petroleum corp ltd
我正在使用以下代码,但它没有给我我想要的东西。
df[grepl(paste(names, collapse="|"), df$titles),]
如何在R中完成?
答案 0 :(得分:2)
如果我理解正确,您可以使用BaseR&#39 {s} gregexpr
以及regematches
和gsub
来完成您的任务。
数据:在OP更改问题后编辑
options(stringsAsFactors = F)
df <- data.frame(titles = c("emami paper mills slips 10% on dismal q4 numbers",
"jsw steel q4fy17 standalone net profit rises 173.33%",
"fmcg major hul q4fy17 standalone net profit rises 6.2",
"chennai petroleum, allsec tech slip 6-7% on poor q4"),stringsAsFactors = F)
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")
<强>正则表达式强>:
library(dplyr)
library(stringr)
newnames <- gsub("^(\\w+).*","\\1",names)
regmat <- regmatches(df$titles,gregexpr(paste0(newnames,collapse="|"),df$titles))
regmat[lapply(regmat,length) == 0] <- NA
df <- data.frame(cbind(df,newnames =do.call("rbind",regmat)),stringsAsFactors = F)
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")
您也可以使用下面的stringr
库:
library(stringr)
newnames <- str_replace(names,"^(\\w+).*","\\1")
df$newnames <- str_extract(df$titles,paste0(newnames,collapse="|"))
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")
<强>输出强>:
> left_join(df,df1,by="newnames")
titles newnames names
1 emami paper mills slips 10% on dismal q4 numbers emami emami ltd
2 jsw steel q4fy17 standalone net profit rises 173.33% jsw jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2 hul hul india ltd
4 chennai petroleum, allsec tech slip 6-7% on poor q4 chennai chennai petroleum corp ltd
答案 1 :(得分:0)
从您的名字中删除有限公司:
names <- gsub(" ltd","",names)
答案 2 :(得分:0)
也可以将names <- data.frame(name = c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs"))
names$lookup <- gsub("(\\w+).*", "\\1", names$name)
用于此类&#34;模糊&#34;合并。
构建查找:
library(sqldf)
res <- sqldf("SELECT l.*, r.name
FROM df as l
LEFT JOIN names as r
ON l.titles LIKE '%'||r.lookup||'%'")
执行合并:
"hul"
一些注意事项:我从查询中提取第一个单词,因为您只说"hul india"
而不是sql
。同样在||
%
表示连接,Reduce
表示通配符(将匹配任何内容),因此如果任何查找出现在文本中的任何位置,无论以前是什么,这将匹配在它之后。
使用df$lookup <- Reduce( function(x, y) {x[grepl(y,x)] <- y; x}, c(list(df$titles), names$lookup))
merge(df, names)
然后合并的另一个选项是:
<button type="submit" class="close" data-dismiss="alert" aria-label="Close"><span aria-hidden="true">×</span>
<strong>Berhasil Tambah Data!</strong> Tambah lagi atau <a href="rangking.php">lihat semua data</a></button>
答案 3 :(得分:0)
要添加到上一个答案,我已经制作了一个功能,包括以前的一些评论:
df <- data.frame(title=c("emami paper mills slips 10% on dismal q4 numbers",
"jsw steel q4fy17 standalone net profit rises 173.33%",
"fmcg major hul q4fy17 standalone net profit rises 6.2"))
names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs")
find_string <- function(data,names){
### Clean the names
newnames <- gsub("^(\\w+).*","\\1",names)
### Loop over the names to find which sentence contain it
for(i in 1:length(newnames)){
if(length(grep(newnames[i],df$title)) != 0){
df$names[grep(newnames[i],df$title)] <- newnames[i]
}else{
print(paste(names[i],"not found in the data!"))
}
}
return(df)
}
### Run the function
find_string(df,names)
希望这有帮助!