我很难在/
和black space
之间提取元素。如果我有两个字符,例如<
和>
,我可以这样做,但空间正在抛弃我。我想在基数R中这样做最有效的方法,因为它将被提供给成千上万的向量。
我想转此:
x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
:此:
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
修改
谢谢大家的答案。我正在寻求速度,因此Andres代码胜出。 Dwin的代码以最大量的代码获胜。德克是你的第二快。字符串解决方案是最慢的(我认为它会是)并且不是基础但是非常容易理解(这实际上是字符串包的意图,我认为这似乎是哈德利对大多数事情的理念。
感谢您的协助。再次感谢。
我以为我会包含基准测试,因为这将是lapplied
几千个向量:
test replications elapsed relative user.self sys.self
1 ANDRES 10000 1.06 1.000000 1.05 0
3 DIRK 10000 1.29 1.216981 1.20 0
2 DWIN 10000 1.56 1.471698 1.43 0
4 FLODEL 10000 8.46 7.981132 7.70 0
答案 0 :(得分:5)
类似但更简洁:
#1- Separate the elements by the blank space
y=unlist(strsplit(x,' '))
#2- extract just what you want from each element:
sub('^.*/([^ ]+).*$','\\1',y)
开始和结束锚点字符
分别是^
和$
,.*
匹配任何字符。
[^ ]+
采用非空白字符。
\\1
是第一个标记的字符
答案 1 :(得分:3)
使用fwd-slash或space的正则表达式:
strsplit(x, "/|\\s" )
[[1]]
[1] "This" "DT" "is" "VBZ" "a" "DT" "short"
[8] "JJ" "sentence" "NN" "consisting" "VBG" "of" "IN"
[15] "some" "DT" "nouns," "JJ" "verbs," "NNS" "and"
[22] "CC" "adjectives." "VBG"
没有仔细阅读Q.可以使用该结果来提取偶数元素:
strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
答案 2 :(得分:2)
这是一个单行:
R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
"of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")),
+ ncol=2, byrow=TRUE)[,2]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
R>
关键是摆脱'斜线前的文字':
R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG"
R>
之后只需要分割字符串
R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN"
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS"
[21] "" "CC" "" "VBG"
并过滤""
。最后一点可能有更紧凑的方法。
R&GT;
答案 3 :(得分:1)
stringr
包具有很好的函数来处理字符串,具有非常直观的名称。在这里,您可以使用str_extract_all
获取所有匹配项(包括前导斜杠),然后使用str_sub
删除斜杠:
str_extract_all(x, "/\\w*")
# [[1]]
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS"
# [11] "/CC" "/VBG"
str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"