在R中按大写解析文本

时间:2015-03-31 00:56:50

标签: r text text-mining uppercase

我有许多大文本文件,其基本构成如下:

text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

如您所见,它由以下内容组成:1)随机文本,2)大写字母,3)语音。

我设法使用以下列表将所有单词分开:

textw<-unlist(strsplit(text," "))
然后我找到大写字的所有位置:

grep(pattern = "^[[:upper:]]*$",x = textw)

我已将人名分成矢量;

upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]

期望的结果将是这样的数据框或表:

Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
         message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))

Result
         person                       message
1                      this is a speech test.
2  FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON     thank you for inviting us

我遇到了麻烦&#34;链接&#34;每条消息给它的作者。

还要注意:有大写单词不是作者,例如&#34;我&#34;。如果只有2个或更多个大写单词彼此相邻,我怎么能指定分隔?

换句话说,如果位置2和3是大写,则将消息放在从位置4到下一次出现的双大写的所有内容中。

任何帮助表示赞赏。

2 个答案:

答案 0 :(得分:8)

以下是使用 stringi 包的一种方法:

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))

data.frame(
    person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
    message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)


##          person                       message
## 1          <NA>        this is a speech text.
## 2  FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON     thank you for inviting us

答案 1 :(得分:2)

基本方法

1)获取文本我将按照Tyler Rinkers的方法将文本拆分为一个或多个(+)只有大写字母([[:upper:]])的序列,这可能还需要空格和冒号([ [:upper:]:]):"[[:upper:]]+[ [:upper:]:]+"

2)提取使用几乎相同的正则表达式的人(不再允许冒号):"[[:upper:]]+[ [:upper:]]+"(再次,基本想法是从Tyler Rinker偷来的)

<强> stringr

require(stringr)

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

data.frame (
    person  = c( NA,
                 unlist(str_extract_all(text, "[[:upper:]]+[ [:upper:]]+"))
                ),
    message = unlist(str_split(text, "[[:upper:]]+[ [:upper:]:]+"))
    )

##          person                        message
## 1          <NA>        this is a speech text. 
## 2  FIRST PERSON hi all, thank you for coming. 
## 3 SECOND PERSON      thank you for inviting us

<强> stringi

require(stringi)

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

data.frame (
    person  = c( NA,
                 unlist(stri_extract_all(text, regex="[[:upper:]]+[ [:upper:]]+"))
                ),
    message = unlist(stri_split(text, regex="[[:upper:]]+[ [:upper:]:]+"))
    )

##          person                        message
## 1          <NA>        this is a speech text. 
## 2  FIRST PERSON hi all, thank you for coming. 
## 3 SECOND PERSON      thank you for inviting us

提示(反映我的偏好而非规则)

1)我希望"[A-Z]+"超过"[A-Z]{1,1000}",因为在第一种情况下,不必决定实际上可能是一个合理的数字。

2)我希望"[[:upper:]]"超过"[A-Z]",因为前者就是这样......

str_extract("Á", "[[:upper:]]")
## [1] "Á"

......而后者就是这样......

str_extract("Á", "[A-Z]")
## [1] NA

...如果是特殊字符。