正则表达式用连字符和下划线连接的单词,同时保持标点符号

时间:2016-06-28 11:40:42

标签: php regex

我一直在阅读,搜索和试用不同的方法来编写正则表达式,例如p {L},[a-z]和\ w但是我似乎无法得到我想要的结果。

问题

我有一个由带标点符号的完整句子组成的数组,我使用以下pre_match通过数组解析,这样可以很好地保留单词和标点符号。

preg_match_all('/(\w+|[.;?!,:])/', $match, $matches)

但是,我现在有这样的话:

  • 字另一个字
  • more_words_like_these

我希望能够保留这些单词的完整性,因为它们是(连接的)但我当前的preg_match将它们分解为单个单词。

我尝试了什么

preg_match_all('/(p{L}-p{L}+|[.;?!,:])/', $match, $matches)

preg_match_all('/((?i)^[\p{L}0-9_-]+|[.;?!,:])/', $match, $matches)

我是从here

找到的

但无法达到预期的效果:

Array ( [0] A, [1] word, [2] like_this, [3] connected, [4] ; ,[5] with-relevant-punctuation)

理想情况下,我也可以考虑特殊字符,因为其中一些单词可能有重音

1 个答案:

答案 0 :(得分:3)

只需将连字符插入字符类即可。但请注意,连字符需要出现在字符集的开头或结尾。否则它将被视为范围符号。

(\w+|[-.;?!,:])

Regular expression visualization

实施例

现场演示

https://regex101.com/r/yI3tM4/2

示例文字

However, I now have words like these:

Word-another-word
more_words_like_these

and I would like to be able to retain the integrity of these words as they are (connected) but my current preg_match breaks them down into individual words.

样本匹配

其他单词像以前一样被捕获,但带有连字符的单词也被捕获

Omitted Match 1-9 for brevity 

MATCH 10
1.  [39-56] `Word-another-word`

MATCH 11
1.  [57-78] `more_words_like_these`

Omitted Match 12+ for brevity 

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [-.;?!,:]                any character of: '-', '.', ';', '?',
                             '!', ',', ':'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------