如何从Sparklyr中的字符串中删除'\'

时间:2018-09-03 12:31:59

标签: r apache-spark text sparklyr

我正在使用sparklyr,并具有一个火花数据框,该数据框的列word包含单词,其中有些包含我要删除的特殊字符。我成功在特殊字符之前使用了regepx_replace\\\\,就像这样:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\\\(', '')) %>% 
  mutate(word = regexp_replace(word, '\\\\)', '')) %>% 
  mutate(word = regexp_replace(word, '\\\\+', '')) %>% 
  mutate(word = regexp_replace(word, '\\\\?', '')) %>%
  mutate(word = regexp_replace(word, '\\\\:', '')) %>%
  mutate(word = regexp_replace(word, '\\\\;', '')) %>%
  mutate(word = regexp_replace(word, '\\\\!', ''))

现在,我要删除\。我都尝试过:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\\\\\', ''))

和:

words.sdf <- words.sdf %>% 
  mutate(word = regexp_replace(word, '\', ''))

但两者都不行...

1 个答案:

答案 0 :(得分:1)

您必须更正R端和Java端转义的代码,因此实际上需要的是"\\\\\\\\"

df <- copy_to(sc, tibble(word = "(abc\\zyx: 1)"))

df %>% mutate(regexp_replace(word, "\\\\\\\\", ""))
# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word           `regexp_replace(word, "\\\\\\\\\\\\\\\\", "")`
  <chr>          <chr>                                         
1 "(abc\\zyx:1)" (abczyx: 1)  

根据您的确切要求,一次匹配所有字符可能会更容易。例如,您可以只保留单词字符(\w)和空格(\s):

df %>% mutate(regexp_replace(word, "[^\\\\w+\\\\s+]", ""))
# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word            `regexp_replace(word, "[^\\\\\\\\w+\\\\\\\\s+]", "")`
  <chr>           <chr>                                                
1 "(abc\\zyx: 1)" abczyx 1     

或仅单词字符

df %>% mutate(regexp_replace(word, "[^\\\\w+]", ""))
# Source:   lazy query [?? x 2]
# Database: spark_shell_connection
  word            `regexp_replace(word, "[^\\\\\\\\w+]", "")`
  <chr>           <chr>                                      
1 "(abc\\zyx: 1)" abczyx1