我正在R中执行文本挖掘任务。
1)计算句子
2)识别引号并将其保存在向量中
错误的句号(例如“ ...”)和句点(例如“先生”)。必须处理。
文本正文数据中肯定有引号,并且其中将包含“ ...”。我正在考虑从主体中提取这些报价,并将它们保存在向量中。 (也需要对它们进行一些操作。)
重要说明:我的文本数据在Word文档中。我使用readtext(“。docx文件的路径”)加载到R中。当我查看文本时,与可复制文本相反,引号只是“但不是\”。
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
问题在于它被错误的句号分割 我尝试的解决方案:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
此问题是它不能代替[小姐。由[小姐
要标识引号:
stri_extract_all_regex(text, '"\\S+"')
但是那也不行。 (它与\“和下面的代码一起使用)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
确切的预期向量是:
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
我希望句子分开(以便我可以计算每个段落中有多少个句子)。 而且引号也分开。
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
答案 0 :(得分:1)
您可以使用
匹配当前的所有vec
值
gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
也就是说,\w+
匹配1个或更多单词字符,而\.
匹配一个点。
接下来,如果您只想提取引号,请使用
regmatches(text, gregexpr('"[^"]*"', text))
"
匹配"
,而[^"]*
匹配0个或多个除"
以外的字符。
如果您打算将句子与引号匹配,则可以考虑
regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
详细信息
\\s*
-超过0个空格"[^"]*"
-一个"
,除"
之外的0个字符和一个"
|
-或[^"?!.]+
-除?
,"
,!
和.
之外的0个以上的字符[[:space:]?!.]+
-1个或多个空格,?
,!
或.
字符[^"[:alnum:]]*
-0+个非字母数字和"
字符R示例代码:
> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "
[2] "Keyboard Jr and Miss Keyboard. ... \n"
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""