使用PHP循环/ foreach进行简单的数据抓取

时间:2015-12-18 00:07:29

标签: php scrape

我有一些代码在两个其他字符串(三明治)之间擦除字符串。它正在工作 - 但我需要遍历各种"三明治"字符串。

//needle in haystack
$result 'sandwich: Today is a nice day.
    sandwich: Today is a cloudy day.
    sandwich: Today is a rainy day.
    sandwich type 2: Yesterday I had an awesome time. 
    sandwich type 2: Yesterday I had an great time.';

$beginString = 'today is a';
$endString = 'day';

function extract_unit($haystack, $keyword1, $keyword2) {
    $return = array();

    while($a = strpos($haystack, $keyword1, $a)) {   // loop until $a is FALSE
        $a+=strlen($keyword1);                    // set offset to after  $keyword1 word

        if($b = strpos($haystack, $keyword2, $a)) {  // if found $keyword2 position's
            $return[] = trim(substr($haystack, $a, $b-$a)); // put result to $return array
        }
    }
    return $return;  
}

$text = $result;
$unit = extract_unit($text, $beginString, $endString);
print_r($unit);

//$unit returns= nice, cloudy and rainy

我需要循环浏览不同类型的句子/三明治,并且能够捕捉所有形容词(漂亮的阴天多雨,非常棒):

//needle in haystack
$result 'sandwich: Today is a nice day.
    sandwich: Today is a cloudy day.
    sandwich: Today is a rainy day.
    sandwich type 2: Yesterday I had an awesome time. 
    sandwich type 2: Yesterday I had an great time.';

$beginString1 = 'today is a';
$endString1 = 'day';
$beginString2 = 'Yesterday I had an';
$endString2 = 'time';

[scaping code with loop...]
print_r($unit);

这是最终使用此数组的目标:

Array ( [0] => nice [1] => cloudy [2] => rainy [3] => awesome [4] => great ) 

有什么想法吗?非常感激。

1 个答案:

答案 0 :(得分:3)

您可以使用正则表达式来搜索strings,如果使用arrays而不是分隔strings时遇到问题,这可能是一个示例代码:

$starts = array('Today is a', 'Yesterday I had an');
$ends = array('day', 'time');

$haystack = array(
    'Today is a nice day.',
    'Today is a cloudy day.',
    'Today is a rainy day.',
    'Yesterday I had an awesome time.',
    'Yesterday I had an great time.'
);

function extract_unit($haystack, $starts, $ends){

    $reg = '/.*?(?:' . implode('|', $starts) . ')(.*?)(?:' . implode('|', $ends) . ').*/';

    foreach($haystack as $str){

        if(preg_match($reg, $str)) $return[] = preg_replace($reg, '$1', $str);

    }

    return $return;

}

print_r (extract_unit($haystack, $starts, $ends));

修改

在@ven评论之后,我对代码进行了一些更改,现在更加精确:

//---Array with all sandwiches
$between = array(
    array('hay1=', 'hay=Gold'),
    array('hay2=', 'hay=Silver')
);

$haystack = 'Data set 1: hay2= this is a bunch of hay  hay1= Gold_Needle hay=Gold
             Data Set 2: hay2=Silver_Needle hay=Silver';

function extract_unit($haystack, $between){

    $return = array();

    foreach($between as $item){

        $reg = '/.*?' . $item[0] . '\s*(.*?)\s*' . $item[1] . '.*?/';

        preg_match_all($reg, $haystack, $finded);

        $return = array_merge($return, $finded[1]);

    }

    return $return;

}

print_r (extract_unit($haystack, $between));

结果将是:

Array
(
    [0] => Gold_Needle
    [1] => Silver_Needle
)

Here you have an Ideone sample code