Question

以下是方案：我正在尝试在某些文字评论上建立机制。例如，我想在一些评论中计算最常用的单词。这是my code：

function cleanWord( &$word ){
    $word = trim($word, "'\".!<>{}()-/\*&^%$#@+~ ");
}

// list of comments
$arr_str =  [
            "  this!! is     the first &test message./",
            "*Second ^message this (is) ",
            "'another\ **message*** !\"}& it is. also the favorite one (message)."
            ];

// To join array's items      
$str = implode(" ", $arr_str);

// To chop the string based on the space
$words = explode(" ",$str);

// To remove redundant character(s)
array_walk($words, 'cleanWord');

// To remove empty array elements
$words = array_filter($words);

print_r($words);

/* Output:
Array
(
    [2] => this
    [3] => is
    [8] => the
    [9] => first
    [10] => test
    [11] => message
    [12] => Second
    [13] => message
    [14] => this
    [15] => is
    [17] => another
    [18] => message
    [20] => it
    [21] => is
    [22] => also
    [23] => the
    [24] => favorite
    [25] => one
    [26] => message
)

正如您在小提琴中看到的那样，$words包含一个包含这些评论中所有单词的数组。我在数据库中也有一个表，我在其中插入单词：

foreach( $words as $word ){
    $db->query("INSERT INTO words (word) 
                       VALUES $word
                ON DUPLICATE KEY UPDATE used_num = used_num + 1");
                -- there is a unique index on "word" column
}

/* Output:
// words
+----+----------+----------+
| id |   word   | used_num |
+----+----------+----------+
| 1  | this     | 2        |
| 2  | is       | 3        |
| 3  | the      | 2        |
| 4  | first    | 1        |
| 5  | test     | 1        |
| 6  | message  | 4        |
| .  | .        | .        |
| .  | .        | .        |
| .  | .        | .        |
+----+----------+----------+

然后我选择最常用的单词：

SELECT * FROM words
ORDER BY used_num DESC
LIMIT $limit

我的问题是什么？！实际上，该数组看起来像这样：

$arr_str =  [
               ["  this!! is     the first &test message./", "Jack", "1488905152"],
               ["*Second ^message this (is) ", "Peter", "1488901178"],
               ["'another\ **message*** !\"}& it is. also the favorite one (message).", "John", "1488895116"]
            ];

如您所见，每个评论还包含作者和发布时间。现在我想：

根据unix-timestamp创建过滤系统。 （例如，在x和y次之间获取最常用的单词）
为每个单词创建作者列表。 （例如，在这些评论中使用了“消息”这个词4次。现在我想访问这些评论的作者列表，即[杰克，彼得，约翰]）

您对实施这些^？

的算法有任何建议吗？

Answer 1

您可以使用正则表达式来清理单词：

$comments = [
  "  this!! is     the first &test message./",
  "*Second ^message this (is) ",
  "'another\ **message*** !\"}& it is. also the favorite one (message)."
];

foreach($comments as $k => $str){
  preg_match_all('/([a-zA-Z]+)/', $str, $matches);
  $exploded[] = $matches[0];
}

print_r($exploded);

但是，您想要将数据附加到每个“单词”，您必须先添加一个表。你的表有每个单词的主键，这很好，因为我们不想存储多余的数据。

现在换另一个表（worddata）：

+----+----------+-----------+
| id |  wordid  | commentid |
+----+----------+-----------+
| 1  | 1        | 2         |
+----+----------+-----------+
          |          \-> refers to the primary key of the comments table
          |
          -> refers to 'this'

现在我假设您有一个表格，其中存储了所有评论（称为comments），这些评论与发布时间相关联并具有作者ID。

实质上，请像这样填写此表：

SELECT comments_id, comments_text FROM comments

过滤您的单词并将其插入表格中：

INSERT INTO worddata (wordid, comment_id)

我建议使用临时表，因为每个注释中的每个单词都应该有自己的行，这可能总计很多数据。 wd.wordid = 1应根据您的'this'表格引用wordlist一词。

如果该值已知，您可以选择日期之间的所有注释，只插入这些注释中的单词。

现在您可以加入表格数据：

SELECT c.id, c.userid, c.created
FROM `comments` as c
  JOIN `worddata` as wd on wd.commentid = c.userid
WHERE wd.wordid = 1

现在，此示例应返回单词为this的所有注释ID。如果您想按author进行过滤，则应更改c.userid = #或WHERE子句。可以使用c.created > NOW() - 3600在过去一小时内为评论选择日期。

当然，如果需要，您可以选择更多数据，但这又是一个连接示例，而不是复制粘贴代码。

Answer 2

这样的表可能有效：


+----+----------+----------+--------------+
| id |   word   | author   | timestamp    |
+----+----------+----------+--------------+
| 1  | this     | author1  |  1488905152  |
| 2  | is       | author1  |  1488905152  |
| 3  | the      | author1  |  1488905152  |
| 4  | first    | author1  |  1488905152  |
| 5  | test     | author1  |  1488905152  |
| 6  | message  | author1  |  1488905152  |
| 7  | Second   | author2  |  1488905152  |
| 8  | this     | author2  |  1488905152  |
| 9  | the      |  .       |              |
| .  |  .       |  .       |              |
| .  |  .       |  .       |              |
| .  |  .       |  .       |              |
+----+----------+----------+--------------+

为了加快查询速度，您可以在列上添加索引。

其他方法是保留你的表，并有一个id，idWord，author，timestamp的第二个表。当您需要作者或时间戳的数据时进行连接。在这种情况下，您可以保留一个仅适用于单词和幻影数量的小数据表，以及一个扩展表，其中包含有关其幻影的更多详细信息。

当ON DUPLICATE KEY执行时，如何保存数据？

2 个答案: