Tokenize引用字符串

时间:2011-08-01 16:37:54

标签: erlang tokenize

我正在尝试对字符串进行标记。只要没有引用字符一切都很好:

string:tokens ("abc def ghi", " ").
["abc","def","ghi"]

但字符串:tokens / 2确实对引用字符串有很大帮助。它表现得像预期的那样:

string:tokens ("abc \"def xyz\" ghi", " ").
["abc","\"def","xyz\"","ghi"]

我需要的是一个函数,它将字符串标记化,分隔符和引号字符。类似的东西:

tokens ("abc \"def xyz\" ghi", " ", "\"").
["abc","def xyz","ghi"]

在我开始重新发明轮子之前,我的问题是:

标准库中是否有这样的功能或类似功能?

修改

好的,我编写了自己的实现,但我对原始问题的答案仍然非常感兴趣。到目前为止,这里是我的代码:

tokens (String) -> tokens (String, [], [] ).

tokens ( [], Tokens, Buffer) ->
    lists:map (fun (Token) -> string:strip (Token, both, $") end, Tokens ++ [Buffer] );

tokens ( [Character | String], Tokens, Buffer) ->
    case {Character, Buffer} of
        {$ , [] } -> tokens (String, Tokens, Buffer);
        {$ , [$" | _] } -> tokens (String, Tokens, Buffer ++ [Character] );
        {$ , _} -> tokens (String, Tokens ++ [Buffer], [] );
        {$", [] } -> tokens (String, Tokens, "\"" );
        {$", [$" | _] } -> tokens (String, Tokens ++ [Buffer ++ "\""], [] );
        {$", _} -> tokens (String, Tokens ++ [Buffer], "\"");
        _ -> tokens (String, Tokens, Buffer ++ [Character] )
    end.

4 个答案:

答案 0 :(得分:5)

如果在一般情况下可接受正则表达式,则可以使用:

> re:split("abc \"def xyz\" ghi", " \"|\" ", [{return, list}]).
["abc","def xyz","ghi"]

如果您想根据任何空格而不是空格进行拆分,也可以使用"\s\"|\"\s"

如果您正在从输入文件中解析此问题,则可能需要使用estring中的strip_split/2

答案 1 :(得分:2)

string:tokens ("abc \"def ghi\" foo.bla", " .\"").将对空格,点和双引号上的字符串进行标记。结果:["abc", "def", "ghi", "foo", "bla"]。如果你想保留引用的部分,你可能要考虑创建一个令牌/词括号,因为正则表达式不是很擅长这项工作。

答案 2 :(得分:1)

您可以使用re模块。它带有split/3功能。例如:

re:split("abc \"def xyz \"ghi", "[(\s\")\s\"]", [{return, list}]).
["abc",[],"def","xyz",[],"ghi"]

第二个参数是正则表达式(您可能需要调整我的示例以删除空列表...)

答案 3 :(得分:1)

这大约是我写的方式(未经测试!):

tokens(String) -> lists:reverse(tokens(String, outside_quotes, [])).

tokens([], outside_quotes, Tokens) ->
  Tokens;
tokens(String, outside_quotes, Tokens) -> 
  {Token, Rest0} = lists:splitwith(fun(C) -> (C != $ ) and (С != $"), String),
  case Rest0 of 
    [] -> [Token | Tokens];
    [$  | Rest] -> tokens(Rest, outside_quotes, [Token | Tokens]);
    [$" | Rest] -> tokens(Rest, inside_quotes, [Token | Tokens])
  end;
tokens(String, inside_quotes, Tokens) -> 
  %% exception on an unclosed quote
  {Token, [$" | Rest]} = lists:splitwith(fun(C) -> С != $", String),
  tokens(Rest, outside_quotes, [Token | Tokens]).