在Clojure中对字符串进行标记

时间:2014-06-05 12:25:38

标签: regex clojure tokenize

我正在尝试使用clojure对字符串进行标记。基本标记化规则要求将字符串拆分为单独的符号,如下所示:

  1. “hello world”形式的字符串文字是单个标记
  2. 不属于字符串文字的每个单词都是单个标记
  3. 每个非单词字符都是一个单独的标记
  4. 例如,给定字符串: length=Keyboard.readInt("HOW MANY NUMBERS? ");

    我希望它被标记为:

    ["length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";"]

    我已经能够根据上面的规则2和3编写一个分割字符串的函数。我在完成第一条规则时遇到了麻烦。 意思是,目前上面的字符串分割如下:

    ["let" "length" "=" "Keyboard" "." "readInt" "(" "\"HOW" "MANY" "NUMBERS?" "\"" ")" ";"]

    这是我的功能:

    (defn TokenizeJackLine [LineOfJackFile]
      (filter not-empty 
        (->
     (string/trim LineOfJackFile)
     ; get rid of all comments
     (string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "") 
     ; split into tokens using 0-width look-ahead
     (string/split #"\s+|(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
     )))
    

    如何按照上述三条规则编写一个将字符串拆分为标记的函数?或者,我应采取什么其他方法来实现所需的标记化?谢谢。

1 个答案:

答案 0 :(得分:1)

删除初始\ s + |从你的分裂使它按你想要的方式工作。这导致字符串在空白字符上分割。

(defn TokenizeJackLine [LineOfJackFile]
  (filter not-empty 
    (->
 (clojure.string/trim LineOfJackFile)
 ; get rid of all comments
 (clojure.string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "") 
 ; split into tokens using 0-width look-ahead
 (clojure.string/split #"(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
 )))

(def input "length=Keyboard.readInt(\"HOW MANY NUMBERS? \");")
(TokenizeJackLine input)

生成此输出:

("length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";")