R-非结构化的Rosette API结果具有良好的结构化精度

时间:2018-07-13 09:26:03

标签: r nlp text-processing

我目前正在使用人员/主要联系人数据的数据集。我正在将URL发送到Rosette API(检测实体)。数据返回解析到具有其位置的不同实体。尝试了以下内容,但准确性不高。

     website                                          website                                          web             type                 mention
1 http://www.sheibanigroup.com               http://www.sheibanigroup.com/#           PERSON Ahmed Abdullah Sheibani
2 http://www.sheibanigroup.com               http://www.sheibanigroup.com/#           PERSON       Abubaker Sheibani
3 http://www.sheibanigroup.com               http://www.sheibanigroup.com/#            TITLE           Vice Chairman
4 http://www.sheibanigroup.com               http://www.sheibanigroup.com/#            TITLE       Managing Director
5 http://www.sheibanigroup.com               http://www.sheibanigroup.com/# IDENTIFIER:EMAIL  info@sheibanigroup.com
6 http://www.kinconsortium.com http://www.kinconsortium.com/structured.html            TITLE               directors
               normalized count mentionOffsets.startOffset mentionOffsets.endOffset positions
1 Ahmed Abdullah Sheibani     4                        921                      944        NA
2       Abubaker Sheibani     2                       1205                     1222       261
3           Vice Chairman     1                       1231                     1244         9
4       Managing Director     1                       1255                     1272        11
5  info@sheibanigroup.com     1                       2232                     2254       960
6               directors     1                       1729                     1738        NA

我所做的是,我计算了仓位差,并指责仓位差。当然会引发一些错误。下面是代码

    NLP2<- NLP2 %>% 
  group_by(web) %>% 
  arrange(web, mentionOffsets.startOffset)%>%
  mutate(positions= mentionOffsets.startOffset- lag(mentionOffsets.endOffset))

g<- NLP2[NLP2$type!= "IDENTIFIER:EMAIL", ]
n<- g %>% group_by(web)
inds= which(n$positions<15 & n$positions>0)
rows <- lapply(inds, function(f) (f-1):f)


final_officers<- data.frame()
intermediate_officers<- data.frame()
for (i in 1:length(rows)){
  l<- g[unlist(rows[i]),][1,c(1:5,9)]
  p<- g[unlist(rows[i]),][2,c(4:5, 9)]
  intermediate_officers<- data.frame(website=l$website, web=l$web, About=l$About, type=l$type, mention=l$mention, type2=p$type, mention2=p$mention, position1= l$mentionOffsets.endOffset, positio2= p$mentionOffsets.endOffset)
  final_officers<- rbind(final_officers,intermediate_officers)
  print(i)}

试图绕过我的头,以提高准确性。任何建议都会有所帮助。

0 个答案:

没有答案