Question

我在R中有一个大型数据框，其中有一个看起来像这样的列，每个句子都是一行

data <- data.frame(
   datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
   "these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
   "anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
   "while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
   stringsAsFactors=FALSE)

我想提取＆＃34; wiki /＆＃34;之后的所有单词。并把它们放在另一栏

因此，对于第一行，它应该出现＆＃34; political_philosophy self-governance＆＃34; 第二行应该看起来像＃34;层次结构free_association_（communism_and_anarchism）＆＃34; 第三行应该是＆＃34; state_（polity）＆＃34; 第四行应该是＃34;反国家主义＆＃34;

我绝对想使用stringi，因为它是一个巨大的数据帧。在此先感谢您的帮助。

我已经尝试了

stri_extract_all_fixed(data$datalist, "wiki")[[1]]

但这只是提取单词wiki

Answer 1

您可以使用正则表达式执行此操作。通过使用stri_match_代替stri_extract_，我们可以使用括号来创建匹配组，这样我们只能提取部分正则表达式匹配。在下面的结果中，您可以看到df的每一行都给出了一个列表项，其中包含第一列中的整个匹配项以及以下列中的每个匹配组的数据框：

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match

[[1]]
     [,1]                        [,2]                  
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance"      "self-governance"     

[[2]]
     [,1]                                              [,2]                                        
[1,] "wiki/stateless_society"                          "stateless_society"                         
[2,] "wiki/hierarchy"                                  "hierarchy"                                 
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"

[[3]]
     [,1]                  [,2]            
[1,] "wiki/state_(polity)" "state_(polity)"

[[4]]
     [,1]                [,2]          
[1,] "wiki/anti-statism" "anti-statism"

然后，您可以使用应用函数将数据转换为您想要的任何形式：

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))

[1] "political_philosophy self-governance"                                  
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"                                                        
[4] "anti-statism"

Answer 2

你可以在正则表达式中使用lookbehind。

library(dplyr)
library(stringi)

text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
 "these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
 "anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",                                                                                                                               
 "while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")

df <- data.frame(text, stringsAsFactors = FALSE)

df %>% 
  mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))

Answer 3

您可以使用

> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy  self-governance" 
[2] "stateless_society  hierarchy  free_association_(communism_and_anarchism)"
[3] "state_(polity)"                                                           
[4] "anti-statism"

请参阅online R code demo。

<强>详情

wiki/(\\S+) - 匹配wiki/并将1个以上的非空白字符捕获到第1组
| - 或
(?:(?!wiki/\\S).)+ - 一个驯化的贪心令牌，它匹配任何字符，而不是换行符，1次出现，不会启动wiki/ +非空白字符序列。

如果您需要删除结果中的多余空格，可以使用另一个gsub调用：

> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"                                   
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"                                                         
[4] "anti-statism"

使用R中的stringi提取字符串中某些字符后面的多个子字符串

3 个答案: