我正在寻找一种从刮取的HTML页面中剥离一些HTML的解决方案。该页面上有一些我想删除的重复数据,所以我尝试使用preg_replace()删除变量数据。
我要剥离的数据:
Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
....
...
之后一定要这样:
Producent:Example
Groep:Example1
Type:Example2
因此,除了数据标题中的单词外,大块是相同的。我如何删除这些数据?
我尝试了一些类似的事情:
$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);
但是那没有用。有什么解决办法吗?
答案 0 :(得分:0)
假设字符串看起来像这样:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';
您可以使用以下命令获取字符串的开头和结尾:
preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);
echo implode([$matches[1], $matches[2]]);
在这种情况下,将抛出 Producent:Example 。因此,您可以将此输出添加到要使用的另一个变量/数组中。 或者,因为您提到替换:
$string = preg_replace('/^(\w+:).*\>(\w+)/', '$1$2', $string);
但是再一次,检查一下它是否可能出现在可变的行数中:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';
$stringRows = explode(PHP_EOL, $string);
$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '$1$2';
foreach ($stringRows as &$stringRow) {
$stringRow = preg_replace($pattern, $replacement, $stringRow);
}
$string = implode(PHP_EOL, $stringRows);
然后将按照您的期望输出字符串。
解释我的正则表达式:
第一组捕获 first 词,直到两个点:
,然后另一组捕获 last 词。我之前已经为两端指定了锚点,但是当中断每一行时,这将无法按预期工作,因此我只保留了开头。
^(\w+:) => the word in the beginning of the string until two dots appear
.*\> => everything else until smaller symbol appears (escaped by slash)
(\w+) => the word after the smaller than symbol
答案 1 :(得分:0)
也许我的问题写得不好。我有一张桌子,需要从网站上抓取。我需要表中的信息,但必须清理提到的某些部分。我最终提出的解决方案就是这个解决方案,它确实有效。手动替换仍然有一点工作要做,但这是因为“他们用英寸表示愚蠢。;-)
解决方案:
\\ find the table in the sourcecode
foreach($techdata->find('table') as $table){
\\ filter out the rows
foreach($table->find('tr') as $row){
\\ take the innertext using simplehtmldom
$tech_specs = $row->innertext;
\\ strip some 'garbage'
$tech_specs = str_replace(" \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
\\ find the first word of the string so I can use it
$spec1 = explode('</td>', $tech_specs)[0];
\\ use the found string to strip down the rest of the table
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
\\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
\\ strip some 'garbage'
$tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
$tech_specs = str_replace("</td>","", $tech_specs);
$tech_specs = str_replace(" ","", $tech_specs);
\\ put the clean row in an array ready for usage
$specs[] = $tech_specs;
}
}
答案 2 :(得分:0)
只需使用通配符:
$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);
.*?
表示匹配任何内容但不要贪婪