如何在R中查找字符串中的所有单词?

时间:2016-03-21 22:05:21

标签: r

我想找到字符向量中的所有单词,但我想假设单词也可以用标点字符分隔,而不仅仅是空格。

我总是可以做s <- strsplit(x, " ")[[1]]这样的事情,以便用空格分隔所有单词,但是如果它们被其他标点符号分开,用户只是忘记包含空格呢?

我相信我需要编写某种正则表达式来匹配单词,并忽略标点符号。

修改

我只想将我的字符串拆分为单词。如果我有类似I,love pizza-because/it tastes.good的内容,我想获得所有字词,意思是"I", "love", "pizza", "because", "it", "tastes", "good"。正如我告诉你的那样,单词是否只用空格分隔,这很容易,但是如果用不同的标点符号分隔它们呢?

我的意思是我总是可以使用类似str_replace_all(x, "[[:punct:]]", " ")的东西,然后用空格分隔它们,但我不想依赖某些外部包,也不要破坏原始字符串形式。

5 个答案:

答案 0 :(得分:3)

您可以将POSIX类[[:punct:]]\\w用于单词字符。 R regex页面讨论了字符类。

tst <- "I,love pizza-because/it tastes.good"
regmatches(tst, gregexpr("\\w+", tst))

答案 1 :(得分:3)

以下是:punct:的选项:

> strsplit("I,love pizza-because/it tastes.good", "[[:punct:] ]")
[[1]]
[1] "I"       "love"    "pizza"   "because" "it"      "tastes"  "good"

答案 2 :(得分:2)

分裂一个否定的单词(\\W)应该可以解决问题。

x <- "Lorem ipsum dolor sit amet, omnes inermis inimicus his an. Impedit
phaedrum torquatos vix ea. Pro ex atqui novum sonet, ut odio graece ridens
vel. Elitr bonorum in sea."

strsplit(x, "\\W")

[[1]]
 [1] "Lorem"           "ipsum"           "dolor"           "sit"             "amet"           
 [6] ""                "omnes"           "inermis"         "inimicus"        "his"            
[11] "an"              ""                "Impedit"         "phaedrum"        "torquatos" 

y <- "I,love pizza-because/it tastes.good"

strsplit(y, "\\W")

[[1]]
[1] "I"       "love"    "pizza"   "because" "it"      "tastes"  "good"   

答案 3 :(得分:1)

使用\ W表示非单词字符:

> strsplit("I,love pizza-because/it tastes.good","\\W")
[[1]]
[1] "I"       "love"    "pizza"   "because" "it"      "tastes"  "good"   

> strsplit("I,love pizza-because/it,, tastes.good","\\W")
[[1]]
[1] "I"       "love"    "pizza"   "because" "it"      ""        ""        "tastes"  "good"   

> strsplit("I,love pizza-because/it,, tastes.good","\\W+")
[[1]]
[1] "I"       "love"    "pizza"   "because" "it"      "tastes"  "good"   

答案 4 :(得分:0)

另一个选项是来自stri_extract_all的{​​{1}}。它已被评论,但不是解决方案格式。

library(stringi)

或者我们可以使用library(stringi) stri_extract_all_regex(tst, "\\w+")[[1]] #[1] "I" "love" "pizza" "because" "it" "tastes" "good" 中的gsub将所有的punct字符替换为单个分隔符,然后base R字符串。

scan

数据

scan(text=gsub("[[:punct:]]", ",", tst), what="", 
                 sep=",", quiet=TRUE)
#[1] "I"          "love pizza" "because"    "it tastes"  "good"