R在tm包中划分文本 - 识别扬声器

时间:2012-01-11 01:51:23

标签: regex r data-mining text-mining

我试图找出大会演讲中最常用的词汇,并且必须由国会议员将它们分开。我刚开始学习R和tm包。我有一个代码可以找到最常用的单词,但我可以用什么样的代码自动识别和存储演讲者?

文字看起来像这样:

OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon.
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings.
[....]

   STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]

我希望能够获得这些名称,或者由人们分开文本。希望您能够帮助我。非常感谢。

2 个答案:

答案 0 :(得分:0)

说你想要分割文件以便每个扬声器有一个文本对象是否正确?然后使用正则表达式来获取每个对象的说话者姓名?然后你可以编写一个函数来收集每个对象上的单词频率等,并将它们放在一个表中,其中行或列名称是说话者的名字。

如果是这样,您可能会说x是您的文字,然后使用strsplit(x, "STATEMENT OF")拆分字词STATEMENT OF,然后grep()str_extract()以返回SENATOR后的2或3个字(他们总是只有两个名字,如你的例子?)。

在这里查看更多关于这些功能的使用以及R#create object containing all text x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon. In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings. STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.") # split object on first two words y <- unlist(strsplit(x, "STATEMENT OF")) #load library containing handy function library(stringr) # use word() to return words in positions 3 to 4 of each string, which is where the first and last names are z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line z # have a look at the result... [1] "HERB KOHL," "BIG APPLE" "LITTLE ORANGE," 的文字处理:http://en.wikibooks.org/wiki/R_Programming/Text_Processing

更新这是一个更完整的答案......

y

毫无疑问,正则表达式向导可以提供更快更整洁的东西!

无论如何,从这里你可以运行一个函数来计算向量{{1}}中每一行的单词频率(即每个说话者的语音),然后创建另一个将单词freq结果与名称结合起来的对象。分析

答案 1 :(得分:0)

这就是我使用Ben的例子来处理它的方法(使用qdap来解析并创建一个数据帧,然后转换为带有3个文档的Corpus;注意qdap是为转录数据而设计的这个和Corpus可能不是最好的数据格式):

library(qdap)
dat <- unlist(strsplit(x, "\\n"))

locs <- grep("STATEMENT OF ", dat)
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2)
dat[locs] <- "SPLIT_HERE"
corp <- with(data.frame(person=nms, dialogue = 
    Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))),
    df2tm_corpus(dialogue, person))

tm::inspect(corp)

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## $`SENATOR BIG APPLE KOHL`
## I am trying to identify the most frequently used words in the  congress speeches, and have to separate them by the congressperson.  I am just starting to learn about R and the tm package. I have a code  that can find the most frequent words, but what kind of a code can I   use to automatically identify and store the speaker of the speech
## 
## $`SENATOR HERB KOHL`
## The Chairman. Good afternoon to everybody, and thank you  very much for coming to this hearing this afternoon.     In today's tough economic climate, millions of seniors have  lost a big part of their retirement and investments in only a  matter of months. Unlike younger Americans, they do not have  time to wait for the markets to rebound in order to recoup a  lifetime of savings.
## 
## $`SENATOR LITTLE ORANGE`
## Would it be correct to say that you want  to split the file so you have one text object  per speaker? And then use a regular expression  to grab the speaker's name for each object? Then  you can write a function to collect word frequencies,  etc. on each object and put them in a table where the  row or column names are the speaker's names.