我一直在根据包含自由文本的列进行一些数据争论。我想从这个文本中识别一组特定的字符串,创建一个列来指定匹配,然后如果特定字段中有多个字符串匹配则复制一行。我已经达到了这样的目标(对任何不喜欢节日的人道歉):
#Example dataframe
require(stringr)
dats<-data.frame(ID=c(1:5),text=c("rudolph","rudolph the","rudolph the red","rudolph the red nosed","rudolph the red nosed reindeer"))
dats
#Regular expression
patt<-c("rudolph","the","red","nosed","reindeer")
reg.patt<-paste(patt,collapse="|")
dats$matched<-str_extract_all(dats$text,reg.patt,simplify=TRUE) %>% unlist()
#Re-shape data
dats2<-data.frame("ID"=dats$ID, "text"=dats$text,"match1"=dats$match[,1],"match2"=dats$match[,2],"match3"=dats$match[,3],"match4"=dats$match[,4],"match5"=dats$match[,5])
dats3<-melt(dats2,id.vars=c("ID","text"))
dats3<-dats3[dats3$value!="",]
dats3$variable<-NULL
dats3<-dats3[order(dats3$ID,decreasing=FALSE),]
dats3
这绝对没问题,但是我确信有更有效的做事方式 - 有没有人有任何建议?
圣诞快乐!
答案 0 :(得分:2)
试试这个:
library(quanteda)
s <- "rudolph the red nosed reindeer"
words <- strsplit(s, " ")[[1]]
do.call(rbind, lapply(words, kwic, x = s))
,并提供:
contextPre keyword contextPost
[text1, 1] [ rudolph ] the red nosed reindeer
[text1, 2] rudolph [ the ] red nosed reindeer
[text1, 3] rudolph the [ red ] nosed reindeer
[text1, 4] rudolph the red [ nosed ] reindeer
[text1, 5] rudolph the red nosed [ reindeer ]
答案 1 :(得分:2)
从cSplit
包中尝试splitstackshape
:
library(splitstackshape)
dats$value <- lapply(str_extract_all(dats$text, reg.patt), toString)
cSplit(dats, 'value', direction="long")
# ID text value
# 1: 1 rudolph rudolph
# 2: 2 rudolph the rudolph
# 3: 2 rudolph the the
# 4: 3 rudolph the red rudolph
# 5: 3 rudolph the red the
# 6: 3 rudolph the red red
# 7: 4 rudolph the red nosed rudolph
# 8: 4 rudolph the red nosed the
# 9: 4 rudolph the red nosed red
# 10: 4 rudolph the red nosed nosed
# 11: 5 rudolph the red nosed reindeer rudolph
# 12: 5 rudolph the red nosed reindeer the
# 13: 5 rudolph the red nosed reindeer red
# 14: 5 rudolph the red nosed reindeer nosed
# 15: 5 rudolph the red nosed reindeer reindeer