如何在R

时间:2017-05-17 12:41:38

标签: r

我在r

中有以下数据框
Id    titles
1     emami paper mills slips 10% on dismal q4 numbers
2     jsw steel q4fy17 standalone net profit rises 173.33%
3     fmcg major hul q4fy17 standalone net profit rises 6.2
4     chennai petroleum, allsec tech slip 6-7% on poor q4

而且,我在矢量中有名字

names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")

我希望将数据框列标题与向量字符串匹配,并在新列中打印相应的字符串。我想要的数据框是

 Id    titles                                                    names
1     emami paper mills slips 10% on dismal q4 numbers           emami ltd
2     jsw steel q4fy17 standalone net profit rises 173.33%       jsw steel ltd
3     fmcg major hul q4fy17 standalone net profit rises 6.2      hul india ltd
4     chennai petroleum, allsec tech slip 6-7% on poor q4        chennai petroleum corp ltd

我正在使用以下代码,但它没有给我我想要的东西。

df[grepl(paste(names, collapse="|"), df$titles),]

如何在R中完成?

4 个答案:

答案 0 :(得分:2)

如果我理解正确,您可以使用BaseR&#39 {s} gregexpr以及regematchesgsub来完成您的任务。

数据:在OP更改问题后编辑

options(stringsAsFactors = F)
df <- data.frame(titles = c("emami paper mills slips 10% on dismal q4 numbers",
                            "jsw steel q4fy17 standalone net profit rises 173.33%",
                            "fmcg major hul q4fy17 standalone net profit rises 6.2",
                            "chennai petroleum, allsec tech slip 6-7% on poor q4"),stringsAsFactors = F)

names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs","chennai petroleum corp ltd")

<强>正则表达式

library(dplyr)
library(stringr)

newnames <- gsub("^(\\w+).*","\\1",names)
regmat <- regmatches(df$titles,gregexpr(paste0(newnames,collapse="|"),df$titles))
regmat[lapply(regmat,length) == 0] <- NA
df <- data.frame(cbind(df,newnames =do.call("rbind",regmat)),stringsAsFactors = F)
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")

您也可以使用下面的stringr库:

library(stringr)
newnames <- str_replace(names,"^(\\w+).*","\\1")
df$newnames <- str_extract(df$titles,paste0(newnames,collapse="|"))
df1 <- data.frame(names=names,newnames=newnames,stringsAsFactors = F)
left_join(df,df1,by="newnames")

<强>输出

    > left_join(df,df1,by="newnames")
                                                 titles newnames                      names
1      emami paper mills slips 10% on dismal q4 numbers    emami                  emami ltd
2  jsw steel q4fy17 standalone net profit rises 173.33%      jsw              jsw steel ltd
3 fmcg major hul q4fy17 standalone net profit rises 6.2      hul              hul india ltd
4   chennai petroleum, allsec tech slip 6-7% on poor q4  chennai chennai petroleum corp ltd

答案 1 :(得分:0)

从您的名字中删除有限公司:

names <- gsub(" ltd","",names)

答案 2 :(得分:0)

也可以将names <- data.frame(name = c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs")) names$lookup <- gsub("(\\w+).*", "\\1", names$name) 用于此类&#34;模糊&#34;合并。

构建查找:

library(sqldf)
res <- sqldf("SELECT l.*, r.name
       FROM df as l
       LEFT JOIN names as r
       ON l.titles LIKE '%'||r.lookup||'%'")

执行合并:

"hul"

一些注意事项:我从查询中提取第一个单词,因为您只说"hul india"而不是sql。同样在|| %表示连接,Reduce表示通配符(将匹配任何内容),因此如果任何查找出现在文本中的任何位置,无论以前是什么,这将匹配在它之后。

使用df$lookup <- Reduce( function(x, y) {x[grepl(y,x)] <- y; x}, c(list(df$titles), names$lookup)) merge(df, names) 然后合并的另一个选项是:

<button type="submit" class="close" data-dismiss="alert" aria-label="Close"><span aria-hidden="true">&times;</span>
  <strong>Berhasil Tambah Data!</strong> Tambah lagi atau <a href="rangking.php">lihat semua data</a></button>

答案 3 :(得分:0)

要添加到上一个答案,我已经制作了一个功能,包括以前的一些评论:

df <-  data.frame(title=c("emami paper mills slips 10% on dismal q4 numbers",
                            "jsw steel q4fy17 standalone net profit rises 173.33%",
                            "fmcg major hul q4fy17 standalone net profit rises 6.2"))


names <- c("emami ltd","jsw steel ltd","abc","hul india ltd","tcs")

find_string <- function(data,names){

    ### Clean the names 
    newnames <- gsub("^(\\w+).*","\\1",names)

    ### Loop over the names to find which sentence contain it
    for(i in 1:length(newnames)){

        if(length(grep(newnames[i],df$title)) != 0){
            df$names[grep(newnames[i],df$title)] <- newnames[i]

        }else{
            print(paste(names[i],"not found in the data!"))
        }
    }
    return(df)
}

### Run the function

find_string(df,names)

希望这有帮助!