PHP str_replace用通配符刮取的内容?

时间:2018-08-17 20:23:08

标签: php preg-replace simple-html-dom

我正在寻找一种从刮取的HTML页面中剥离一些HTML的解决方案。该页面上有一些我想删除的重复数据,所以我尝试使用preg_replace()删除变量数据。

我要剥离的数据:

Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
.... 
...

之后一定要这样:

Producent:Example
Groep:Example1
Type:Example2

因此,除了数据标题中的单词外,大块是相同的。我如何删除这些数据?

我尝试了一些类似的事情:

$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);

但是那没有用。有什么解决办法吗?

3 个答案:

答案 0 :(得分:0)

假设字符串看起来像这样:

$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';

您可以使用以下命令获取字符串的开头和结尾:

preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);

echo implode([$matches[1], $matches[2]]);

在这种情况下,将抛出 Producent:Example 。因此,您可以将此输出添加到要使用的另一个变量/数组中。 或者,因为您提到替换

$string = preg_replace('/^(\w+:).*\>(\w+)/', '$1$2', $string);

但是再一次,检查一下它是否可能出现在可变的行数中:

$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';

$stringRows = explode(PHP_EOL, $string);

$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '$1$2';
foreach ($stringRows as &$stringRow) {
    $stringRow = preg_replace($pattern, $replacement, $stringRow);
}

$string = implode(PHP_EOL, $stringRows);

然后将按照您的期望输出字符串。

解释我的正则表达式: 第一组捕获 first 词,直到两个点:,然后另一组捕获 last 词。我之前已经为两端指定了锚点,但是当中断每一行时,这将无法按预期工作,因此我只保留了开头。

^(\w+:) => the word in the beginning of the string until two dots appear
.*\>    => everything else until smaller symbol appears (escaped by slash)
(\w+)   => the word after the smaller than symbol 

答案 1 :(得分:0)

也许我的问题写得不好。我有一张桌子,需要从网站上抓取。我需要表中的信息,但必须清理提到的某些部分。我最终提出的解决方案就是这个解决方案,它确实有效。手动替换仍然有一点工作要做,但这是因为“他们用英寸表示愚蠢。;-)

解决方案:

   \\ find the table in the sourcecode
   foreach($techdata->find('table') as $table){

    \\ filter out the rows
    foreach($table->find('tr') as $row){

    \\ take the innertext using simplehtmldom
    $tech_specs = $row->innertext;

    \\ strip some 'garbage'
    $tech_specs = str_replace("  \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);

    \\ find the first word of the string so I can use it    
    $spec1 = explode('</td>', $tech_specs)[0];

    \\ use the found string to strip down the rest of the table
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);

    \\ manual correction because of the " used
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);

    \\ manual correction because of the " used
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);

    \\ strip some 'garbage'
    $tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
    $tech_specs = str_replace("</td>","", $tech_specs);
    $tech_specs = str_replace("  ","", $tech_specs);

    \\ put the clean row in an array ready for usage
    $specs[] = $tech_specs;
    }
  }

答案 2 :(得分:0)

只需使用通配符:

$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);

.*?表示匹配任何内容但不要贪婪