我有许多大文本文件,其基本构成如下:
text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
如您所见,它由以下内容组成:1)随机文本,2)大写字母,3)语音。
我设法使用以下列表将所有单词分开:
textw<-unlist(strsplit(text," "))
然后我找到大写字的所有位置:
grep(pattern = "^[[:upper:]]*$",x = textw)
我已将人名分成矢量;
upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]
期望的结果将是这样的数据框或表:
Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))
Result
person message
1 this is a speech test.
2 FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON thank you for inviting us
我遇到了麻烦&#34;链接&#34;每条消息给它的作者。
还要注意:有大写单词不是作者,例如&#34;我&#34;。如果只有2个或更多个大写单词彼此相邻,我怎么能指定分隔?
换句话说,如果位置2和3是大写,则将消息放在从位置4到下一次出现的双大写的所有内容中。
任何帮助表示赞赏。
答案 0 :(得分:8)
以下是使用 stringi 包的一种方法:
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))
data.frame(
person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
答案 1 :(得分:2)
基本方法
1)获取文本我将按照Tyler Rinkers的方法将文本拆分为一个或多个(+
)只有大写字母([[:upper:]]
)的序列,这可能还需要空格和冒号([ [:upper:]:]
):"[[:upper:]]+[ [:upper:]:]+"
2)提取使用几乎相同的正则表达式的人(不再允许冒号):"[[:upper:]]+[ [:upper:]]+"
(再次,基本想法是从Tyler Rinker偷来的)
<强> stringr 强>
require(stringr)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(str_extract_all(text, "[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(str_split(text, "[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
<强> stringi 强>
require(stringi)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(stri_extract_all(text, regex="[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(stri_split(text, regex="[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
提示(反映我的偏好而非规则)
1)我希望"[A-Z]+"
超过"[A-Z]{1,1000}"
,因为在第一种情况下,不必决定实际上可能是一个合理的数字。
2)我希望"[[:upper:]]"
超过"[A-Z]"
,因为前者就是这样......
str_extract("Á", "[[:upper:]]")
## [1] "Á"
......而后者就是这样......
str_extract("Á", "[A-Z]")
## [1] NA
...如果是特殊字符。