提取角色和空间之间的元素

时间:2012-03-31 19:51:55

标签: r

我很难在/black space之间提取元素。如果我有两个字符,例如<>,我可以这样做,但空间正在抛弃我。我想在基数R中这样做最有效的方法,因为它将被提供给成千上万的向量。

我想转此:

x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"

:此:

 [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"

修改

谢谢大家的答案。我正在寻求速度,因此Andres代码胜出。 Dwin的代码以最大量的代码获胜。德克是你的第二快。字符串解决方案是最慢的(我认为它会是)并且不是基础但是非常容易理解(这实际上是字符串包的意图,我认为这似乎是哈德利对大多数事情的理念。

感谢您的协助。再次感谢。

我以为我会包含基准测试,因为这将是lapplied几千个向量:

    test replications elapsed relative user.self sys.self
1 ANDRES        10000    1.06 1.000000      1.05        0
3   DIRK        10000    1.29 1.216981      1.20        0
2   DWIN        10000    1.56 1.471698      1.43        0
4 FLODEL        10000    8.46 7.981132      7.70        0

4 个答案:

答案 0 :(得分:5)

类似但更简洁:

#1- Separate the elements by the blank space

    y=unlist(strsplit(x,' '))

#2- extract just what you want from each element:

    sub('^.*/([^ ]+).*$','\\1',y)

开始和结束锚点字符 分别是^$.*匹配任何字符。 [^ ]+采用非空白字符。 \\1是第一个标记的字符

答案 1 :(得分:3)

使用fwd-slash或space的正则表达式:

strsplit(x, "/|\\s" )
[[1]]
 [1] "This"        "DT"          "is"          "VBZ"         "a"           "DT"          "short"      
 [8] "JJ"          "sentence"    "NN"          "consisting"  "VBG"         "of"          "IN"         
[15] "some"        "DT"          "nouns,"      "JJ"          "verbs,"      "NNS"         "and"        
[22] "CC"          "adjectives." "VBG"   

没有仔细阅读Q.可以使用该结果来提取偶数元素:

strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
 [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"

答案 2 :(得分:2)

这是一个单行:

R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
              "of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")), 
+         ncol=2, byrow=TRUE)[,2]
 [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"
R> 

关键是摆脱'斜线前的文字':

R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT  VBZ  DT  JJ  NN  VBG  IN  DT  JJ  NNS  CC  VBG"
R> 

之后只需要分割字符串

R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
 [1]  ""    "DT"  ""    "VBZ" ""    "DT"  ""    "JJ"  ""    "NN"
 [11] ""    "VBG" ""    "IN"  ""    "DT"  ""    "JJ"  ""    "NNS" 
 [21] ""    "CC"  ""    "VBG"

并过滤""。最后一点可能有更紧凑的方法。     R&GT;

答案 3 :(得分:1)

stringr包具有很好的函数来处理字符串,具有非常直观的名称。在这里,您可以使用str_extract_all获取所有匹配项(包括前导斜杠),然后使用str_sub删除斜杠:

str_extract_all(x, "/\\w*")
# [[1]]
#  [1] "/DT"  "/VBZ" "/DT"  "/JJ"  "/NN"  "/VBG" "/IN"  "/DT"  "/JJ"  "/NNS"
# [11] "/CC"  "/VBG"

str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
#  [1] "DT"  "VBZ" "DT"  "JJ"  "NN"  "VBG" "IN"  "DT"  "JJ"  "NNS" "CC"  "VBG"