我想拆分文本,我正在关注示例1:
示例1:
> x <- "Split the words in a sentence."
> strsplit(x, " ")
[[1]]
[1] "Split" "the" "words" "in"
[5] "a" "sentence."
所以我试图拆分NewString:
> NewString
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "
> strsplit(NewString,' ')
[[1]]
[1] "s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 "
该函数不会拆分文本。奇怪的是,如果复制NewString的输出并将其粘贴到strsplit():
>strsplit("s14 v13 s13 s13 v12 s12 v11 s11 v10 s10 s10 v09 s09 v08 s08 v07 s07 v06 s06 v05 s05 v04 s04 v03 s03 v02 s02 s01 v00 ",' ')
[[1]]
[1] "s14" "v13" "s13" "s13" "v12" "s12" "v11" "s11" "v10" "s10" "s10" "v09" "s09"
[14] "v08" "s08" "v07" "s07" "v06" "s06" "v05" "s05" "v04" "s04" "v03" "s03" "v02"
[27] "s02" "s01" "v00"
可能是什么问题?
(使用rvest包输出NewString)
编辑: CharToRaw提供以下输出:
> charToRaw(lol)
[1] 73 31 34 c2 a0 76 31 33 c2 a0 73 31 33 c2 a0 73 31 33 c2 a0 76 31 32 c2 a0
[26] 73 31 32 c2 a0 76 31 31 c2 a0 73 31 31 c2 a0 76 31 30 c2 a0 73 31 30 c2 a0
[51] 73 31 30 c2 a0 76 30 39 c2 a0 73 30 39 c2 a0 76 30 38 c2 a0 73 30 38 c2 a0
[76] 76 30 37 c2 a0 73 30 37 c2 a0 76 30 36 c2 a0 73 30 36 c2 a0 76 30 35 c2 a0
[101] 73 30 35 c2 a0 76 30 34 c2 a0 73 30 34 c2 a0 76 30 33 c2 a0 73 30 33 c2 a0
[126] 76 30 32 c2 a0 73 30 32 c2 a0 73 30 31 c2 a0 76 30 30 c2 a0
答案 0 :(得分:2)
可以使用stringi
包和stri_split
来完成此操作。
首先让一个字符串由相同的字符分隔(194/160是十六进制的C2A0):
s=rawToChar(as.raw(c(65,66,48,194, 160,65,67,49,194,160,65,68,50)))
> s
[1] "AB0 AC1 AD2"
普通str_split
无效:
> str_split(s,"\\s+")
[[1]]
[1] "AB0 AC1 AD2"
但请安装stringi
和:
> stri_split(s,regex="\\s+")
[[1]]
[1] "AB0" "AC1" "AD2"
我怀疑stringi
对空格(\ s)有更广泛的概念。