Question

我正在尝试标记包含字符串的数据框。有些包含连字符，我想使用unnest_tokens（）标记连字符

我尝试将tidytext从0.1.9升级到0.2.0 我在正则表达式上尝试了多种变体来捕获以下字符中的连字符：



df <- data.frame(words = c("Solutions for the public sector | IT for business", "Transform the IT experience - IT Transformation - ITSM")

df %>% 
unnest_tokens(query, words, 
                token = "regex",
                pattern = "(?:\\||\\:|[-]|,)")

我希望看到：

query
solutions for the public sector
it for business
transform the it experience
it transformation
itsm

相反，我没有标记的连字符行：

query
solutions for the public sector
it for business

Answer 1

您可以使用

library(stringr)
df %>%  
  unnest_tokens(query, words, token = stringr::str_split, pattern = "[-:,|]")

此命令将使用stringr::str_split来分割[-:,|]模式：-，:，,或|字符。注意，它们不需要在字符类/括号表达式中转义。连字符是第一个或最后一个字符时，不需要转义，其他字符在字符类中也不是特殊的。

如何在R

1 个答案: