我被告知我不应该使用R来扫描文本(但是我一直这样做,无论如何,等待获得其他技能)并遇到一个让我困惑的问题,足以让我退回到这些论坛。提前感谢您的帮助。
我正在尝试将大量文本(例如,短篇故事)存储为字符串向量,每个字符串都是一个单独的句子。我一直在使用scan()函数执行此操作,但我遇到两个基本问题:(1)scan()似乎只允许单个分隔字符,而句子显然可以以多种方式结束。我知道如何使用正则表达式来标记句子的结尾(例如[!?\。],但我不知道R中使用正则表达式来分割文本的函数。(2)scan()似乎自动地考虑作为新字段的新行,而我希望它忽略新行,除非它们与句子的结尾重合。
download.file("http://www.textfiles.com/stories/3lpigs.txt","threelittlepigs.txt")
threelittlepigs_s<-scan("threelittlepigs.txt",character(0),
sep=".",quote=NULL)
如果我不包含'quote = NULL'选项,则scan()会抛出EOF(字段结束,我在猜测)属于带引号的字符串的警告。这产生了一些多行元素/字段,但非常不稳定。我似乎无法辨别出一种模式。
很抱歉,如果之前有人询问过。我确信这是一个简单的解决方案。我更喜欢能帮助我理解为什么scan()不能按照我期望的方式工作,但是如果有更好的工具来读取R中的文本,请告诉我。
答案 0 :(得分:3)
R具有一些非常强大的文本挖掘功能,包含许多强大的包。例如,tm
,rvest
,stringi
和其他人。
但是这里有一个简单的例子,几乎完全在基础R中执行此操作。我只使用%>%
中的magrittr
管道,因为我认为这会使代码更具可读性。
您问题的具体答案是您可以使用正则表达式来搜索多个标点符号。在下面的示例中,我使用"[\\.?!] "
,表示句点,问号或感叹号,后跟空格。你可能需要试验。
试试这个:
library("magrittr")
url <- "http://www.textfiles.com/stories/3lpigs.txt"
corpus <- url %>%
paste(readLines(url), collapse=" ") %>%
gsub("http://www.textfiles.com/stories/3lpigs.txt", "", .)
head(corpus)
z <- corpus %>%
gsub(" +", " ", .) %>%
strsplit(split = "[\\.?!] ")
z[[1]]
结果:
z[[1]]
[1] " THE THREE LITTLE PIGS Once upon a time "
[2] ""
[3] ""
[4] "there were three little pigs, who left their mummy and daddy to see the world"
[5] "All summer long, they roamed through the woods and over the plains,playing games and having fun"
[6] "None were happier than the three little pigs, and they easily made friends with everyone"
[7] "Wherever they went, they were given a warm welcome, but as summer drew to a close, they realized that folk were drifting back to their usual jobs, and preparing for winter"
[8] "Autumn came and it began to rain"
[9] "The three little pigs started to feel they needed a real home"
[10] "Sadly they knew that the fun was over now and they must set to work like the others, or they'd be left in the cold and rain, with no roof over their heads"
...etc