Tokenize包含非单词字符的单词的字符串

时间:2013-04-28 21:52:21

标签: javascript regex tokenize

我想将Twitter消息标记为包括哈希和现金标签。标记化的正确示例如下:

"Bought $AAPL today,because of the new #iphone".match(...);
>>>> ['Bought', '$AAPL', 'today', 'because', 'of', 'the', 'new', '#iphone']

我为此任务尝试了几个正则表达式,即:

"Bought $AAPL today,because of the new #iphone".match(/\b([\w]+?)\b/g);
>>>> ['Bought', 'AAPL', 'today', 'because', 'of', 'the', 'new', 'iphone']

"Bought $AAPL today,because of the new #iphone".match(/\b([\$#\w]+?)\b/g);
>>>> ['Bought', 'AAPL', 'today', 'because', 'of', 'the', 'new', 'iphone']

"Bought $AAPL today,because of the new #iphone".match(/[\b^#\$]([\w]+?)\b/g);
>>>> ['$AAPL', '#iphone']

我可以使用哪种正则表达式,在令牌中包含前导锐利或美元符号?

1 个答案:

答案 0 :(得分:2)

显而易见的

"Bought $AAPL today,because of the new #iphone".match(/[$#]*\w+/g)
// ["Bought", "$AAPL", "today", "because", "of", "the", "new", "#iphone"]

PS:[$#]*可能会被[$#]?取代,不确定具体要求。