我目前正在使用人员/主要联系人数据的数据集。我正在将URL发送到Rosette API(检测实体)。数据返回解析到具有其位置的不同实体。尝试了以下内容,但准确性不高。
website website web type mention
1 http://www.sheibanigroup.com http://www.sheibanigroup.com/# PERSON Ahmed Abdullah Sheibani
2 http://www.sheibanigroup.com http://www.sheibanigroup.com/# PERSON Abubaker Sheibani
3 http://www.sheibanigroup.com http://www.sheibanigroup.com/# TITLE Vice Chairman
4 http://www.sheibanigroup.com http://www.sheibanigroup.com/# TITLE Managing Director
5 http://www.sheibanigroup.com http://www.sheibanigroup.com/# IDENTIFIER:EMAIL info@sheibanigroup.com
6 http://www.kinconsortium.com http://www.kinconsortium.com/structured.html TITLE directors
normalized count mentionOffsets.startOffset mentionOffsets.endOffset positions
1 Ahmed Abdullah Sheibani 4 921 944 NA
2 Abubaker Sheibani 2 1205 1222 261
3 Vice Chairman 1 1231 1244 9
4 Managing Director 1 1255 1272 11
5 info@sheibanigroup.com 1 2232 2254 960
6 directors 1 1729 1738 NA
我所做的是,我计算了仓位差,并指责仓位差。当然会引发一些错误。下面是代码
NLP2<- NLP2 %>%
group_by(web) %>%
arrange(web, mentionOffsets.startOffset)%>%
mutate(positions= mentionOffsets.startOffset- lag(mentionOffsets.endOffset))
g<- NLP2[NLP2$type!= "IDENTIFIER:EMAIL", ]
n<- g %>% group_by(web)
inds= which(n$positions<15 & n$positions>0)
rows <- lapply(inds, function(f) (f-1):f)
final_officers<- data.frame()
intermediate_officers<- data.frame()
for (i in 1:length(rows)){
l<- g[unlist(rows[i]),][1,c(1:5,9)]
p<- g[unlist(rows[i]),][2,c(4:5, 9)]
intermediate_officers<- data.frame(website=l$website, web=l$web, About=l$About, type=l$type, mention=l$mention, type2=p$type, mention2=p$mention, position1= l$mentionOffsets.endOffset, positio2= p$mentionOffsets.endOffset)
final_officers<- rbind(final_officers,intermediate_officers)
print(i)}
试图绕过我的头,以提高准确性。任何建议都会有所帮助。