Question

我需要在R的第2列数据框中找到第1列中某行中某些字符之间的所有字符串。然后，需要将它们放入具有第1列中具有SAME列的实例的新表或数据框中第1列旁边的原始数据框中的2个值。这是例子。我需要将数据帧foo.df的第1列中xx ... xx之间的所有单词放入第2列的新表中，以显示位于foo.df中相应行中的用户：

我们可以这样制作数据框：

text <- c('hello xxthisxx is a xxtestxx of','we xxarexx very happy','you will xxwantxx to help') 
user <- c('person1','person2','person3') 
foo.df <- data.frame(text,user)

然后我要在xx之间复制单词，这样最终结果将如下所示：

 text      user
 this   person1
 test   person1
 are    person2
 want   person3

我尝试过的一切似乎都没有效果。谢谢。

Answer 1

这是一个使用cSplit包中的splitstackshape来将数据表拆分并转换为长格式的想法。之后，我们会过滤xx...xx格式的条目，最后删除开头和结尾的xx，即

library(splitstackshape)

cSplit(foo.df, 'text', ' ', 'long')[grepl('xx.*xx', text),][,text := gsub('xx(.*)xx', '\\1', text)][]
#   text    user
#1: this person1
#2: test person1
#3:  are person2
#4: want person3

Answer 2

Tidyverse方法，使用lookahead和lookbehid正则表达式

   dependencies {
      implementation fileTree(include: '*.jar', dir: 'libs')
      // SUB-PROJECT DEPENDENCIES START
        implementation(project(path: ":CordovaLib"))
        compile "com.android.support:support-v4:24.1.1+"
        compile "com.soundcloud.android:android-crop:1.0.0@aar"
        compile "com.google.android.gms:play-services-maps:15.0.1"
        compile "com.google.android.gms:play-services-location:15.0.1"
        compile "com.android.support:support-core-utils:27.+"
        compile "com.android.support:support-annotations:27.+"
        compile "com.android.support:appcompat-v7:23+"
        compile "com.google.firebase:firebase-core:10+"
        compile "com.google.firebase:firebase-messaging:10+"
        // SUB-PROJECT DEPENDENCIES END
    }

Answer 3

这里是regmatches/gregexpr的{{1}}选项

out <- stack(setNames(regmatches(foo.df$text, 
   gregexpr("(?<=xx)[^ ]+(?=xx)", foo.df$text, perl = TRUE)), foo.df$user))
names(out) <- names(foo.df)
out    
#  text    user
#1 this person1
#2 test person1
#3  are person2
#4 want person3

从数据框中的行复制与模式匹配的字符串，然后将其放入新数据框中的新列

3 个答案: