我使用优秀的tidytext
包来标记几段中的句子。例如,我想采取以下段落:
"我完全相信达西先生没有任何缺陷。他自己拥有它而没有伪装。"
并将其标记为两个句子
但是,当我使用tidytext
的默认句子标记符时,我会得到三个句子。
代码
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
结果
# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.
使用tidytext
标记句子的简单方法是什么,但不会遇到普通缩写的问题,例如&#34; Mr。&#34;或&#34; Dr。&#34;被解释为句子结尾?
答案 0 :(得分:2)
你可以使用正则表达式作为分裂条件,但不能保证这包括所有常见的hororifics:
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\\b\\p{L}r)\\.")
<强>结果:强>
# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise
您当然可以创建自己的常用标题列表,并根据该列表创建一个正则表达式:
titles = c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)
答案 1 :(得分:2)
语料库和 quanteda 在确定句子边界时对缩写有特殊处理。以下是使用语料库分割句子的方法:
library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
text_split(df$Example_Text, "sentences")
## parent index text
## 1 1 1 I am perfectly convinced by it that Mr. Darcy has no defect.
## 2 1 2 He owns it himself without disguise.
如果您想坚持使用unnest_tokens
,但想要更详尽的英文缩写列表,可以按照@ useR的建议使用 corpus 缩写列表(大多数都是来自Common Locale Data Repository):
abbrevations_en
## [1] "A." "A.D." "a.m." "A.M." "A.S." "AA."
## [7] "AB." "Abs." "AD." "Adj." "Adv." "Alt."
## [13] "Approx." "Apr." "Aug." "B." "B.V." "C."
## [19] "C.F." "C.O.D." "Capt." "Card." "cf." "Col."
## [25] "Comm." "Conn." "Cont." "D." "D.A." "D.C."
## (etc., 155 total)