我正在尝试使用clojure对字符串进行标记。基本标记化规则要求将字符串拆分为单独的符号,如下所示:
例如,给定字符串:
length=Keyboard.readInt("HOW MANY NUMBERS? ");
我希望它被标记为:
["length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";"]
我已经能够根据上面的规则2和3编写一个分割字符串的函数。我在完成第一条规则时遇到了麻烦。 意思是,目前上面的字符串分割如下:
["let" "length" "=" "Keyboard" "." "readInt" "(" "\"HOW" "MANY" "NUMBERS?" "\"" ")" ";"]
这是我的功能:
(defn TokenizeJackLine [LineOfJackFile]
(filter not-empty
(->
(string/trim LineOfJackFile)
; get rid of all comments
(string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "")
; split into tokens using 0-width look-ahead
(string/split #"\s+|(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
)))
如何按照上述三条规则编写一个将字符串拆分为标记的函数?或者,我应采取什么其他方法来实现所需的标记化?谢谢。
答案 0 :(得分:1)
删除初始\ s + |从你的分裂使它按你想要的方式工作。这导致字符串在空白字符上分割。
(defn TokenizeJackLine [LineOfJackFile]
(filter not-empty
(->
(clojure.string/trim LineOfJackFile)
; get rid of all comments
(clojure.string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "")
; split into tokens using 0-width look-ahead
(clojure.string/split #"(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
)))
(def input "length=Keyboard.readInt(\"HOW MANY NUMBERS? \");")
(TokenizeJackLine input)
生成此输出:
("length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";")