如何拆分Ocaml中的空格?

时间:2016-10-02 03:29:38

标签: regex string ocaml

空白区域是空格,制表符或换行符(即回车符或换行符)

我假设\s涵盖\t\n\r\f

但是当我尝试使用\s时,它无法正确拆分字符串:

# let line1 = "We the People of the United States, in Order to form a more perfect";;

# let wsp_regex = Str.regexp "\\s+";;
# let words = Str.split wsp_regex line1;;
val words : string list = 
["We the People of the United State"; ", in Order to form a more perfect"]

# let wsp_regex = Str.regexp "[ \\s]+";;
# let words = Str.split wsp_regex line1;;
val words : string list = 
["We"; "the"; "People"; "of"; "the"; "United"; "State"; ","; "in"; "Order"; "to"; "form"; "a"; "more"; "perfect"]

# let wsp_regex = Str.regexp "[\\s]+";;
# let words = Str.split wsp_regex line1;;
val words : string list = 
["We the People of the United State"; ", in Order to form a more perfect"]

# let wsp_regex = Str.regexp "[ \\s\\t\\n\\r]+";;
# let words = Str.split wsp_regex line1;;
val words : string list =                                                         
["We"; "he"; "People"; "of"; "he"; "U"; "i"; "ed"; "S"; "a"; "e"; ","; "i"; "O"; "de"; "o"; "fo"; "m"; "a"; "mo"; "e"; "pe"; "fec"]

# let wsp_regex = Str.regexp "[\s]+";;
Characters 29-31:                                                               
Warning 14: illegal backslash escape in string.                                 
val wsp_regex : Str.regexp = <abstr>   

# let words = Str.split wsp_regex line1;;
val words : string list =                                                         
["We the People of the United State"; ", in Order to form a more perfect"]

# let wsp_regex = Str.regexp "[ \s]+";;
Characters 30-32:                                                               
Warning 14: illegal backslash escape in string.                                 
val wsp_regex : Str.regexp = <abstr>
# let words = Str.split wsp_regex line1;;
val words : string list =                                                         
["We"; "the"; "People"; "of"; "the"; "United"; "State"; ","; "in"; "Order"; "to"; "form"; "a"; "more"; "perfect"]

# let wsp_regex = Str.regexp "[ \t\n\r\f]+";;
Characters 36-38:                                                               
Warning 14: illegal backslash escape in string.                                 
val wsp_regex : Str.regexp = <abstr>  
# let words = Str.split wsp_regex line1;;
val words : string list =                                                         
["We"; "the"; "People"; "o"; "the"; "United"; "States,"; "in"; "Order"; "to"; "orm"; "a"; "more"; "per"; "ect"] 

# let wsp_regex = Str.regexp "[\t\n\r\f]+";;
Characters 35-37:                                                               
Warning 14: illegal backslash escape in string.                                 
val wsp_regex : Str.regexp = <abstr>
# let words = Str.split wsp_regex line1;;
val words : string list =                                                         
["We the People o"; " the United States, in Order to "; "orm a more per"; "ect"]

似乎唯一有效的案例是:

# let wsp_regex = Str.regexp "[ ]+";;
# let words = Str.split wsp_regex line1;;
val words : string list =                                                         
["We"; "the"; "People"; "of"; "the"; "United"; "States,"; "in"; "Order"; "to"; "form"; "a"; "more"; "perfect"]

# let wsp_regex = Str.regexp "[ \t\n\r]+";;
# let words = Str.split wsp_regex line1;;
val words : string list =                                                         
["We"; "the"; "People"; "of"; "the"; "United"; "States,"; "in"; "Order"; "to"; "form"; "a"; "more"; "perfect"]

我不确定为什么第二种情况有效,因为[ \s]+无法正常工作(Ocaml认为我想分开s

我想要的只是在不使用的情况下拆分空白,因为我还要捕获\t\n\r\f

但是我似乎无法弄清楚如何在Ocaml中创建一个正则表达式来分割白色空格。

如果有人能为我提供一个非常感激的工作表达方式!

1 个答案:

答案 0 :(得分:7)

Str module的文档中,您会发现\s不受支持。因此,您的第一个表达式将在字符s的序列上分隔单词。事实上,这就是你所看到的。

使用\s的其他任何尝试均无效,因为\s不受支持。

令人惊讶的是,即使\n(双字符号)也不支持作为正则表达式。因此,如果要匹配换行符,则需要在正则表达式模式中使用实际换行符。换句话说,您希望字符串具有以下内容:"\n",而不是:"\\n"\r\t也是如此。

OCaml字符串语法不接受符号\f。如果您想匹配表单Feed,则需要使用其十六进制表示法\x0c

综上所述,您的模式应为:"[ \n\r\x0c\t]+"

# Str.split (Str.regexp "[ \n\r\x0c\t]+") line1;;
- : string list =
["We"; "the"; "People"; "of"; "the"; "United"; "States,"; "in";
 "Order"; "to"; "form"; "a"; "more"; "perfect"]

有一个Perl兼容的正则表达式包,您可能会觉得使用起来更舒服:https://opam.ocaml.org/packages/pcre/pcre.7.1.5/