使用R格式化特定单词的文本并将其添加到新数据框

时间:2015-12-09 13:59:33

标签: r grep plyr

我修改了以下代码,逐行搜索特定关键字的文本文件(在本例中为行为)。除了计算每行中的行为数量之外,我还想将识别的单词(特定行为)添加到新数据框中,该数据框将包含在整个文本中找到的所有单词的列表。任何帮助将不胜感激。

# required packages
library(plyr)
library(stringr)

# load behavior database
behavior.words=scan('C:/databases/behaviors.txt',what='character',comment.char=';')

# routine to find behaviors in text files and append them to the data frame
indentify.behavior = function(sentences, behavior.words, .progress='none')
{
    require(plyr)
    require(stringr)

    # we got a vector of sentences. plyr will handle a list
    # or a vector as an "l" for us
    # we want a data.frame of scores back, so we use
    # "l" + "d" + "ply" = "ldply":
    behaviors.df = ldply(sentences, function(sentence, behavior.words) {

        # clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)
        # and convert to lower case:
        sentence = tolower(sentence)

        # split into words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')
        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)

        # compare our words to the list of behavior words
        behavior.matches = match(words, behavior.words)

        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        behavior.matches = !is.na(behavior.matches)

        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        behavior.count = sum(behavior.matches)
        behavior = behavior.count

        return(data.frame(behavior.count = behavior.count))
    }, behavior.words, .progress=.progress )

    # append sentence text:
    behaviors.df$text = sentences
    return(behaviors.df)
}

# Load the narrative
StoryText=scan("Narratives/story100.txt", character(0), sep = ".") 

# Calculate results for indentifying behaviors
CTA.df=indentify.behavior(StoryText,behavior.words,.progress='text') # scores the number of behaviors of each line of text

0 个答案:

没有答案