Question

我正在寻找一种从刮取的HTML页面中剥离一些HTML的解决方案。该页面上有一些我想删除的重复数据，所以我尝试使用preg_replace（）删除变量数据。

我要剥离的数据：

Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
.... 
...

之后一定要这样：

Producent:Example
Groep:Example1
Type:Example2

因此，除了数据标题中的单词外，大块是相同的。我如何删除这些数据？

我尝试了一些类似的事情：

$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);

但是那没有用。有什么解决办法吗？

Answer 1

假设字符串看起来像这样：

$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';

您可以使用以下命令获取字符串的开头和结尾：

preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);

echo implode([$matches[1], $matches[2]]);

在这种情况下，将抛出 Producent：Example 。因此，您可以将此输出添加到要使用的另一个变量/数组中。或者，因为您提到替换：

$string = preg_replace('/^(\w+:).*\>(\w+)/', '$1$2', $string);

但是再一次，检查一下它是否可能出现在可变的行数中：

$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';

$stringRows = explode(PHP_EOL, $string);

$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '$1$2';
foreach ($stringRows as &$stringRow) {
    $stringRow = preg_replace($pattern, $replacement, $stringRow);
}

$string = implode(PHP_EOL, $stringRows);

然后将按照您的期望输出字符串。

解释我的正则表达式：第一组捕获 first 词，直到两个点:，然后另一组捕获 last 词。我之前已经为两端指定了锚点，但是当中断每一行时，这将无法按预期工作，因此我只保留了开头。

^(\w+:) => the word in the beginning of the string until two dots appear
.*\>    => everything else until smaller symbol appears (escaped by slash)
(\w+)   => the word after the smaller than symbol

Answer 2

也许我的问题写得不好。我有一张桌子，需要从网站上抓取。我需要表中的信息，但必须清理提到的某些部分。我最终提出的解决方案就是这个解决方案，它确实有效。手动替换仍然有一点工作要做，但这是因为“他们用英寸表示愚蠢。;-）

解决方案：

   \\ find the table in the sourcecode
   foreach($techdata->find('table') as $table){

    \\ filter out the rows
    foreach($table->find('tr') as $row){

    \\ take the innertext using simplehtmldom
    $tech_specs = $row->innertext;

    \\ strip some 'garbage'
    $tech_specs = str_replace("  \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);

    \\ find the first word of the string so I can use it    
    $spec1 = explode('</td>', $tech_specs)[0];

    \\ use the found string to strip down the rest of the table
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);

    \\ manual correction because of the " used
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);

    \\ manual correction because of the " used
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);

    \\ strip some 'garbage'
    $tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
    $tech_specs = str_replace("</td>","", $tech_specs);
    $tech_specs = str_replace("  ","", $tech_specs);

    \\ put the clean row in an array ready for usage
    $specs[] = $tech_specs;
    }
  }

Answer 3

只需使用通配符：

$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);

.*?表示匹配任何内容但不要贪婪

PHP str_replace用通配符刮取的内容？

3 个答案: