preg_match用大写字母和连续的大写单词来查找单词

时间:2014-10-03 19:41:57

标签: php regex preg-match-all

我正在尝试通过过滤掉符合以下条件的字词来匹配字符串中的关键字:

  • 其中包含大写的字样,如“iPhone”或“camelCase”
  • “匹兹堡钢人队”或“奥斯卡·德拉霍亚”等连续大写词组
  • 结合上述标准,如“iPhone 5”或“MIB 2”(也将数字视为大写字母)
  • 崩溃任何非字符/数字,以便“O'Donnell's”将“ODonnells”和“Wi-fi ......”将是“Wifi”

示例:

$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";

preg_match_all("/[A-Z][a-z]*/",$string,$match_words); // incorrect expression

// desired result for $match_words should be: 
// array(Joe ODonnell, Oscar De La Hoya, Pittsburgh Steelers, Sunday, Joe, iPhone 5, Oscars, iPad)

由于

5 个答案:

答案 0 :(得分:3)

您可以使用这样的正则表达式:

\b((?:[A-Z]['a-z]*\s*\d*)+)\b|\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b

<强> Working demo

enter image description here

匹配信息:

MATCH 1
1.  [0-14]  `Joe O'Donnell `
MATCH 2
1.  [18-35] `Oscar De La Hoya `
MATCH 3
1.  [45-65] `Pittsburgh Steelers `
MATCH 4
1.  [73-79] `Sunday`
MATCH 5
1.  [87-91] `Joe `
MATCH 6
2.  [100-108]   `iPhone 5`
MATCH 7
1.  [125-133]   `Oscar's `
MATCH 8
2.  [133-137]   `iPad`

正则表达式由两种模式组成:

\b((?:[A-Z]['a-z]*\s*\d*)+)\b       ---> Match words like Joe O'Connels or Oscar De La Hoya
|
\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b ---> Match words like iPad or iPhone

顺便说一句,如果你看一下结果,它最后会有一个尾随空格,你可以对结果进行修剪以清理它。

答案 1 :(得分:3)

您可以先删除所有非字母数字字符:

$string2 = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);

然后使用preg_split而不是preg_replace拆分字符串的完整小写字词序列。

 $match_words = preg_split("/ ([a-z]| )+ /", $string2);

(如果您不介意$string被销毁,可以将$string2替换为$string

这适用于您提供的示例,但请考虑您希望程序使用较少的清理输入进行操作的方式。例如,"Foo Bar"(两个空格)将被分成两个元素,而"Foo Bar"(一个空格)将保持为一个。如果您不担心速度,可以使用另一个preg_replace来用一个空格替换任何空格序列。

答案 2 :(得分:2)

您可以在此处使用PHP的ctype_lower功能!

<?php

$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";

$words = $temp = array();

// Loop through the string after turning it into an array (by spaces)
foreach (explode(" ", $string) as $word) {
    // Check if the word is lowercase and is not a number
    if (ctype_lower($word) && !is_numeric($word)) {
        if (empty($temp)) continue; // Don't add it if there's nothing to add

        // Add the words found up until this point (from the last point) into the words array, as a string
        $words[] = implode(" ", $temp);

        // Reset the temp array so we can look for new words and continue
        $temp = array();
        continue;
    }

    // Add this word to the words array
    $temp[] = $word;
}

$words[] = implode(" ", $temp);

// Print the words that have uppercase characters
printf("<pre>%s</pre>", print_r($words, true));

返回:

Array
(
    [0] => Joe O'Donnell
    [1] => Oscar De La Hoya
    [2] => Pittsburgh Steelers
    [3] => Sunday,
    [4] => Joe
    [5] => iPhone 5,
    [6] => Oscar's iPad
)

答案 3 :(得分:2)

添加到联邦的甜蜜答案,这将是您的新PHP代码:

$string = "Joe O'Donnell and Oscar De La Hoya went to a Pittsburgh Steelers game on Sunday, where Joe lost his iPhone 5, so he borrowed Oscar's iPad";

preg_match_all("/\b((?:[A-Z]['a-z]*\s*\d*)+)\b|\b((?:[a-z]*[A-Z]['a-z]*\s*\d*)+)\b/", $string, $matches);

print_r($matches[0]);

$ matches [0]将是你的匹配数组。

答案 4 :(得分:0)

除了Fede,Kelly和Daniel之外,还有2种重音语言替代品

使用preg_split

$capitalized_words = preg_split("/ ([a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]| )+ /u", $string);

使用preg_match_all

//with 'u' flag 
preg_match_all("/\b((?:[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b|\b((?:[a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b/u", $string, $capitalized_words);

使用preg_match_alltrim

一起使用的功能
function get_capitalized_words($string){
    $capitalized_words=array();

    //with 'u' flag 
    preg_match_all("/\b((?:[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b|\b((?:[a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*[A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÆ]['a-zàèìòùáéíóúýâêîôûãñõäëïöüÿçßøåæœ]*\s*\d*)+)\b/u", $string, $matches);

    if(isset($matches[0])){
        $capitalized_words=array_map('trim',$matches[0]);
    }

    return $capitalized_words;
}