Question

我一直在构建一个函数读取<title></title>标签之间的网页上的标题文本。我正在使用以下正则表达式代码从html页面中获取标题文本：

 if(preg_match('#<title>([^<]+)</title>#simU', $this->html, $m1))
      $this->title = trim($m1[1]);

我正在使用以下代码来编码mysql insert语句的值：

mysql_real_escape_string(rawurldecode($this->title))

因此，我留下了一个包含html实体（＆amp; nsbp等...）和标题的数据库外国字符，例如Dating S.o.sÂ |Â Gluten-free, Dairy-free, Sugar-free Recipes And Lifestyle Tips

目标是解码，删除，清理标题，使它们看起来尽可能接近完美的英语。

我构建了一个函数，它使用以下2个正则表达式来删除html实体并分别限制垃圾。虽然不理想（因为它删除了html实体而不是保留它们），但它最接近干净，就像我一样。

$string = preg_replace("/&#?[a-z0-9]+;/i","",$string);
//remove all non-normal chars
$string = preg_replace('/[^a-zA-Z0-9-\s\'\!\,\|\(\)\.\*\&\#\/\:]/', '', $string);

但非英语字符仍然存在。

是否有人能够提供以下方面的帮助：

将这些标题字符串保存到数据库以尝试保留英语意图（标点符号，叛逆者等等）的最佳方法
如何转换或删除上面示例标题中显示的奇怪字符？

非常感谢你的帮助！

Answer 1

对于第1点，PHP有一个html_entity_decode()函数，可用于将HTML实体转换为“常规”字符。

Answer 2

查看http://www.php.net/manual/en/function.html-entity-decode.php以获取＃1

而http://php.net/manual/en/function.mb-convert-encoding.php代表＃2

使用php＆amp; amp;清除从网页上删除的文字正则表达式

2 个答案: