从MySQL获取的字符串

时间:2017-05-17 14:27:01

标签: php html unicode preg-replace removing-whitespace

我尝试做的是获取一块html,删除所有html标签,并将每行文本放入PHP数组中。

我只是尝试用一个块来测试(因此我的mysql查询中的WHERE ID = '2409'

ID 2409的HTML部分如下所示:

<table class="description-table">
<tbody>
<tr><td>Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td></tr>
<tr><td>Description</td></tr>
<tr><td></td>
<td><br>
<br><p></p><p></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem </strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong> PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong>                                                           ad Quisque Modeste</strong><strong>                                                           ac Rem Wisi</strong><strong>                                                           ex Hac Congue mus Leo</strong><strong>                                                           ab 7/92" Alias</strong><strong>                                                           ad 2/73" Adverso & Erat</strong><strong>                                                           me Personom Eget</strong><strong>                                                           ad Viribus Fuga Fuga</strong><strong>                                                           ab Louor-Sit Molles</strong><strong class="c2">                                                           3x Block-Off Plates</strong><strong class="c2">                                                           ad Facunda</strong><strong class="c2">                                                           ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong></strong><br>
</td>
</table>

这是我的PHP脚本,旨在解析此

//connect to mysqli

$results = $mysqli->query("SELECT ID, post_content
FROM wp_posts'
WHERE ID = '2409';");

while($row = $results->fetch_array()) {
    $htmlarray2 = preg_split('/<.+?>/', $row['post_content']);
    $htmlarray = array_values(array_filter(array_map('trim', $htmlarray2)));
    echo '<pre>';
        print_r($htmlarray);
    echo '</pre>';
    . . . 
}

这会产生这样的输出

Array
(
[0] => Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9
[1] => Donec Rem 
[2] => Animam Urgebat
[3] => Rerum Sed 8613 - 3669 8358 & 6699
[4] => 1.mE (magNA) QUO Ad Nominum Statum Massa
[5] => ab SEM Autem Reddet Habitu Sit
[6] =>  PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM
[7] => Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!
[8] =>                                                            ad Quisque Modeste
[9] =>                                                            ac Rem Wisi
[10] =>                                                            ex Hac Congue mus Leo
[11] =>                                                            ab 7/92" Alias
[12] =>                                                            ad 2/73" Adverso & Erat
[13] =>                                                            me Personom Eget
[14] =>                                                            ad Viribus Fuga Fuga
[15] =>                                                            ea Totam Poenam
[16] =>                                                            ab Louor-Sit Molles
[17] =>                                                            ad Facunda
[18] =>                                                            ab Personas Diam
[19] => NUNC
[20] => ex Teniet te Palmam Eaque
[21] => me Teniet in Versus Urna
[22] => **CONDEMNENDUS REM CUM MAGNORUM**
)

这没关系,但现在我在删除数组中字符串之前和之后的空格时遇到了问题。

让我们举一个数组中节点8的例子

. . .
$arrayvalue = $htmlarray2['8'];

这样回应

                                                       ad Quisque Modeste

现在,我尝试做的事情显然是修剪了数组的每个元素,但是为了测试我只使用这个变量$arrayvalue

我的问题是trim()没有使用这个MySQL获取的变量。添加trim($arrayvalue);的含义没有任何影响,并以与上述相同的方式回应。

我知道这与我通过查询获取数组有关,因为如果我只是在它自己的php脚本中正常测试这个变量

$string = '                                                            ad Quisque Modeste  ';
echo trim($string);

它工作正常,回声输出只是ad Quisque Modeste,在字符串之前或之后没有所需的空格。

为什么我的trim()循环中没有while工作? 从元素中修剪前导和尾随空格的技巧是什么?

编辑:根据要求,这是我的完整循环。它与上面的例子有点不同(我已经做了很多修改,试图自己解决这个问题,所以它不断变化),但这就是我现在所拥有的全部内容:< / p>

while($row = $results->fetch_array()) {
    $id = $row['ID'];
    echo 'ID: ' . $id;
    echo '<br  />';

    //replace &nbsp; with white space
    $converted = strtr($row['post_content'],array_flip(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES))); 
    trim($converted, chr(0xC2).chr(0xA0));

    //remove html elements
    $htmlarray = preg_split('/<.+?>/', $converted);

    // remove empty array elements and re-index array
    $htmlarray2 = array_values(array_filter(array_map('trim', $htmlarray)));

    // test by getting single value from array
    $arrayvalue = $htmlarray2['9'];

    // my attempt to trim string in while loop
    trim($arrayvalue);

    // doesn't trim
    echo '<hr>' . $arrayvalue . '<hr>';

    // put this here so I can see the full array
    echo '<pre>';
        print_r($htmlarray2);
    echo '</pre>';
}

根据要求,以下是var_export($row['post_content']);

的结果
'<table class="product-description-table">
<tbody>
<tr>
<td class="item" colspan="3">Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td>
</tr>
<tr>
<td class="title" colspan="3"></td>
</tr>
<tr>
<td class="content"><br>
<br>
<p class="c1"></p>
<p class="c1"></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem&nbsp;</strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong>&nbsp;PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Quisque Modeste</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ac Rem Wisi</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ex Hac Congue mus Leo</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab 7/92" Alias</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad 2/73" Adverso & Erat</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;me Personom Eget</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Viribus Fuga Fuga</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab Louor-Sit Molles</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3x Block-Off Plates</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Facunda</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong>&nbsp;</strong><br></td>
<td class="product-content-border"></td>
</tr>
<tr>
<td class="gallery" colspan="3">
<table>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td class="spacer" colspan="3"></td>
</tr>
<tr>
<td class="product-content-border"></td>
</tr>
</tbody>
</table>
<br>
<br>
<br>
<p class="c4"></p>'

最终编辑:):

在下面发布解决方案。不接受我自己的答案。

如果熟悉正则表达式的人可以帮助解释所有这些背后的灾难以及为什么这个正则表达式公式:/[\s]+/mu或者更确切地说$clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray);解决了这个问题,我很乐意接受这个作为正确的答案,解释

3 个答案:

答案 0 :(得分:1)

以下是有关解决您问题的正则表达式模式的请求说明:

/[\s]+/说“查找一个或多个空格字符(包括:  '','\ r','\ n','\ t','\ f','\ v')。 multi-line修饰符/标记不是必需的,因为您没有在模式中使用锚点^ $)。 unicode修饰符/标记在您的情况下是绝对批评,因为您的html文本字符串包含许多名为...的小恶魔。

  

“NO-BREAK SPACE”,是由194表示的unicode字符160\x{00A0}的组合。突出显示here

如果没有u标记,则会保留NO-BREAK SPACE个字符,并且需要进行其他过滤才能将其删除。

虽然您最终将代码输入正确的输出。我很高兴能够制作一个更精简的单步模式,使用preg_split()纯粹更快地将你带到那里。

while($row=$results->fetch_array()){
    $texts=preg_split('/\s*<[^>]+>\s*/u',$row['post_content'],null,PREG_SPLIT_NO_EMPTY);
    var_export($texts);
}

这是一个有效的demo

这种新的拆分模式仍会查找您的代码,但效率更高,因为在<>之间,我只是要求匹配所有“不是>”的字符使用[^>]+。对于引擎来说,这比.代表的长字符列表要求匹配要简单得多。

此外,我还为您的unicode扩展空白字符添加了匹配项。 \s*将在每个标记之前和之后匹配零个或多个空白字符。

最后,我应该解释preg_split()上的其他参数。 null说“查找无限匹配” - 这是默认行为,但我必须使用null-1作为其值来保持其位置以确保使用最终参数。 PREG_SPLIT_NO_EMPTY您不必再采取额外步骤array_filter()。它省略了从拆分中生成的任何空元素,因此您只能获得好东西。

我希望你发现这有用/有教育意义。祝你的项目好运。

答案 1 :(得分:0)

修剪不起作用。你想要这个:

$arrayvalue = trim($arrayvalue);

真的是这样。修剪返回修剪后的字符串:它不会修改变量。

答案 2 :(得分:0)

我找到了解决方案。

不确定它是如何工作的......我对正则表达式很不熟悉。

但我找到的解决方案(也许有人可以解释一下?)是

$clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray);

有效的整个脚本(不包括MySQL的东西)

$converted = html_entity_decode( $row['post_content'], ENT_QUOTES);
$converted = trim($converted, chr(0xC2).chr(0xA0));

$htmlarray = preg_split('/<.+?>/', $converted);

$clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray);

$htmlarray2 = array_filter(array_map('trim', $clean_htmlarray));

$clean_htmlarray2 = array_values($htmlarray2);

echo '<pre>';
print_r($clean_htmlarray2);
echo '</pre>';

输出

Array
(
    [0] => Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9
    [1] => Description
    [2] => Donec Rem
    [3] => Animam Urgebat
    [4] => Rerum Sed 8613 - 3669 8358 & 6699
    [5] => 1.mE (magNA) QUO Ad Nominum Statum Massa
    [6] => ab SEM Autem Reddet Habitu Sit
    [7] => PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM
    [8] => Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!
    [9] => ad Quisque Modeste
    [10] => ac Rem Wisi
    [11] => ex Hac Congue mus Leo
    [12] => ab 7/92" Alias
    [13] => ad 2/73" Adverso & Erat
    [14] => me Personom Eget
    [15] => ad Viribus Fuga Fuga
    [16] => ab Louor-Sit Molles
    [17] => 3x Block-Off Plates
    [18] => ad Facunda
    [19] => ab Personas Diam
    [20] => NUNC
    [21] => ex Teniet te Palmam Eaque
    [22] => me Teniet in Versus Urna
    [23] => **CONDEMNENDUS REM CUM MAGNORUM**
)

一个完全修剪的阵列。

这也适用于所有行的while循环,即:

$results = $mysqli->query("SELECT ID, post_content
FROM wp_posts'
LIMIT 50;");

在这种情况下,我得到所有50行完全修剪过的字符串。

所以最后......这是一个挑战要弄清楚!

我只是希望我能理解它。我真的不觉得我应该被确认为这个问题的答案,因为我真正做的只是尝试不同的东西,最后这个工作。

如果有人想要插入并解释为什么$clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray);或更确切地说/[\s]+/mu是我在这种情况下所需要的,我很乐意为他们提供答案:)

至于现在,我很高兴它正常工作。感谢大家的帮助和投入!