我一直在阅读,搜索和试用不同的方法来编写正则表达式,例如p {L},[a-z]和\ w但是我似乎无法得到我想要的结果。
我有一个由带标点符号的完整句子组成的数组,我使用以下pre_match通过数组解析,这样可以很好地保留单词和标点符号。
preg_match_all('/(\w+|[.;?!,:])/', $match, $matches)
但是,我现在有这样的话:
我希望能够保留这些单词的完整性,因为它们是(连接的)但我当前的preg_match将它们分解为单个单词。
preg_match_all('/(p{L}-p{L}+|[.;?!,:])/', $match, $matches)
和
preg_match_all('/((?i)^[\p{L}0-9_-]+|[.;?!,:])/', $match, $matches)
我是从here
找到的但无法达到预期的效果:
Array ( [0] A, [1] word, [2] like_this, [3] connected, [4] ; ,[5] with-relevant-punctuation)
理想情况下,我也可以考虑特殊字符,因为其中一些单词可能有重音
答案 0 :(得分:3)
只需将连字符插入字符类即可。但请注意,连字符需要出现在字符集的开头或结尾。否则它将被视为范围符号。
(\w+|[-.;?!,:])
现场演示
https://regex101.com/r/yI3tM4/2
示例文字
However, I now have words like these:
Word-another-word
more_words_like_these
and I would like to be able to retain the integrity of these words as they are (connected) but my current preg_match breaks them down into individual words.
样本匹配
其他单词像以前一样被捕获,但带有连字符的单词也被捕获
Omitted Match 1-9 for brevity
MATCH 10
1. [39-56] `Word-another-word`
MATCH 11
1. [57-78] `more_words_like_these`
Omitted Match 12+ for brevity
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[-.;?!,:] any character of: '-', '.', ';', '?',
'!', ',', ':'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------