在我的新闻页面项目中,我有一个数据库表 news ,其结构如下:
- id: [integer] unique number identifying the news entry, e.g.: *1983*
- title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
- topic: [string] category which should be chosen by the classificator, e.g: *Sports*
此外,还有一个表 bayes ,其中包含有关字频的信息:
- word: [string] a word which the frequencies are given for, e.g.: *real estate*
- topic: [string] same content as "topic" field above, e.h. *Economics*
- count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
现在我希望我的PHP脚本对所有新闻条目进行分类,并为其分配几个可能的类别(主题)之一。
这是正确的实施吗?你能改进吗?
<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
$pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
if (!isset($pWords[$pWords3['topic']])) {
$pWords[$pWords3['topic']] = array();
}
$pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
$pTextInTopics = array();
$tokens = tokenizer($get3['title']);
foreach ($pTopics as $topic=>$documentsInTopic) {
if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
foreach ($tokens as $token) {
echo '....'.$token;
if (isset($pWords[$topic][$token])) {
$pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
}
}
$pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
}
asort($pTextInTopics); // pick topic with lowest value
if ($chosenTopic = each($pTextInTopics)) {
echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
}
}
?>
培训是手动完成的,不包含在此代码中。如果文本“你可以赚钱,如果你出售房地产”被分配到类别/主题“经济学”,那么所有单词(你,可以,制作,......)插入表贝叶斯将“Economics”作为主题,将1作为标准计数。如果该单词已经与同一主题组合在一起,则计数会递增。
示例学习数据:
单词主题计数
kaczynski政治1
索尼科技1
银行经济学1
手机技术1
索尼经济学3爱立信科技2
示例输出/结果:
文字标题:手机测试索尼爱立信Aspen - 敏感的Winberry
政治
....电话 ....测试 索尼.... 爱立信.... 白杨.... ....敏感 .... winberry
技术
....手机发现 ....测试 ....索尼找到了 ......爱立信找到了 白杨.... ....敏感 .... winberry
经济
....电话 ....测试 ....索尼找到了 爱立信.... 白杨.... ....敏感 .... winberry
结果:该文字属于主题技术,可能性为0.013888888888889
非常感谢您提前!
答案 0 :(得分:7)
看起来您的代码是正确的,但有一些简单的方法可以优化它。例如,您可以为每个单词动态计算p(单词|主题),同时您可以预先轻松计算这些值。 (我假设你想在这里对多个文件进行分类,如果你只做一个文件,我认为这是可以的,因为你没有为不在文件中的文字计算它)
同样,p(主题)的计算可以移到循环之外。
最后,您不需要对整个数组进行排序以找到最大值。
所有小点!但这就是你要求的:)
我写了一些未经测试的PHP代码,显示了我如何在下面实现这个:
<?php
// Get word counts from database
$nWordPerTopic = mystery_sql();
// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
// Get total word count in topic
$nTopic = array_sum($wordCounts);
// Calculate p(word|topic)
$pWordPerTopic[$topic] = array();
foreach($wordCounts as $word => $count)
$pWordPerTopic[$topic][$word] = $count / $nTopic;
// Save $nTopic for next step
$nTopics[$topic] = $nTopic;
}
// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
$pTopics[$topic] = $nTopic / $nTotal;
// Classify
foreach($documents as $document)
{
$title = $document['title'];
$tokens = tokenizer($title);
$pMax = -1;
$selectedTopic = null;
foreach($pTopics as $topic => $pTopic)
{
$p = $pTopic;
foreach($tokens as $word)
{
if (!array_key_exists($word, $pWordPerTopic[$topic]))
continue;
$p *= $pWordPerTopic[$topic][$word];
}
if ($p > $pMax)
{
$selectedTopic = $topic;
$pMax = $p;
}
}
}
?>
至于数学...
你正在尝试最大化p(主题|单词),所以找到
arg max p(topic|words)
(IE是p(主题|单词)最高的论点主题)
贝叶斯定理说
p(topic)*p(words|topic)
p(topic|words) = -------------------------
p(words)
所以你正在寻找
p(topic)*p(words|topic)
arg max -------------------------
p(words)
由于文档的p(单词)对于任何主题都相同,因此与查找
相同arg max p(topic)*p(words|topic)
天真的贝叶斯假设(这使得它成为一个朴素的贝叶斯分类器)就是那个
p(words|topic) = p(word1|topic) * p(word2|topic) * ...
所以使用它,你需要找到
arg max p(topic) * p(word1|topic) * p(word2|topic) * ...
其中
p(topic) = number of words in topic / number of words in total
和
p(word, topic) 1
p(word | topic) = ---------------- = p(word, topic) * ----------
p(topic) p(topic)
number of times word occurs in topic number of words in total
= -------------------------------------- * --------------------------
number of words in total number of words in topic
number of times word occurs in topic
= --------------------------------------
number of words in topic