将字符串拆分为bi-gram,忽略某些标签

时间:2012-02-10 00:30:33

标签: php

考虑以下字符串:

I have had the greatest {A} {B} day yesterday {C}

我想用bi-gram创建一个数组,忽略所有标签(标签在{bracket}之间)

[0] => I-have
[1] => have-had
[2] => had-the
[3] => the-greatest
[4] => greatest-day
[5] => day-yesterday

在PHP中,最好的方法是什么?使用正则表达式或爆炸“”然后迭代所有单词?我在这里开始遇到麻烦,所以任何帮助都会非常感激:)

2 个答案:

答案 0 :(得分:2)

使用explode可以轻松完成:

$string="I have had the greatest {A} {B} day yesterday {C}";

$words=explode(" ",$string);

$filtered_words=array();

foreach($words as $w)
{
  if(!preg_match("/{.*}/",$w))
  {
    array_push($filtered_words,$w);
  }
}


$output=array();

foreach(range(0,count($filtered_words)-2) as $i)
{
  array_push($output,$filtered_words[$i] . "-" . $filtered_words[$i+1]);
}

var_dump($output);

输出结果为:

array(6) {
  [0]=>
  string(6) "I-have"
  [1]=>
  string(8) "have-had"
  [2]=>
  string(7) "had-the"
  [3]=>
  string(12) "the-greatest"
  [4]=>
  string(12) "greatest-day"
  [5]=>
  string(13) "day-yesterday"
}

答案 1 :(得分:1)

略有不同的方法:

$string = '{D} I have had the greatest {A} {B} day yesterday {C}';

// explode on spaces
$arr = explode(' ', $string);
$bigrams = array();

// remove all "labels" with regex (assuming it matches \w)
$arr = array_values(array_filter($arr, function($s){
    return !preg_match("/\{\w\}/", $s);
}));

// get the bigrams
$len = count($arr);
for ($i = 0; $i <= $len - 2; $i++) {
    $bigrams[] = $arr[$i] . '-' . $arr[$i+1];
}

print_r($bigrams);