计算帖子的相似度

时间:2019-12-13 09:37:06

标签: php

我正在使用php 7.3,并且正在计算帖子的相似度。

<?php

$posts = [
    'post_count' => 3,
    'posts' => [
        [
            'ID' => 1,
            'post_content' => "Wrong do point avoid by fruit learn or in death. So passage however besides invited comfort elderly be me. Walls began of child civil am heard hoped my. Satisfied pretended mr on do determine by.",
        ],
        [
            'ID' => 2,
            'post_content' => "Lorem ipsum dolor sit"
        ],
        [
            'ID' => 3,
            'post_content' => "Months on ye at by esteem desire warmth former. Sure that that way gave any fond now. His boy middleton sir nor engrossed affection excellent."
        ],
        [
            'ID' => 4,
            'post_content' => "Lorem ipsum dolor sit"
        ],
    ]
];

print_r($posts);

function getNonSimilarTexts($posts)
{
    $similarityPercentageArr = array();

    for ($i = 0; $i <= $posts['post_count']; $i++) {
        // $posts->the_post();
        $currentPost = $posts['posts'][$i];
        if (!is_null($currentPost['ID'])) {
            for ($y = 0; $y <= $posts['post_count']; $y++) {
                $comparePost = $posts['posts'][$y];
                if (!is_null($comparePost['ID'])) {
                    similar_text(strip_tags($currentPost['post_content']), strip_tags($comparePost['post_content']), $perc);
                    // similarity is 100 if self compare
                    if ($perc != 100) {
                        array_push($similarityPercentageArr, [$currentPost['ID'], $comparePost['ID'], $perc]);
                    }
                }
            }
        }
    }
    return $similarityPercentageArr;
}

$p = getNonSimilarTexts($posts);
print_r($p);

如您所见,我得到的是一个[[ID, ID, similarity_percentage],...]

作为数组的输出

我想过滤该数组并取出所有>20%的相似点,此外,我只想保留1个相似的帖子并除去其他东西。我想要的结果是帖子ID:1,2,3

有人建议如何过滤这样的数组吗?

2 个答案:

答案 0 :(得分:1)

similar_text

similar_text — Calculate the similarity between two strings

levenshtein

levenshtein — Calculate Levenshtein distance between two strings

soundex

soundex — Calculate the soundex key of a string

关于您的问题,将其读回后,似乎标题与您的查询不太匹配!

仅通过其他条件还不够吗?

<?php

$posts = [
    'post_count' => 3,
    'posts' => [
        [
            'ID' => 1,
            'post_content' => "Wrong do point avoid by fruit learn or in death. So passage however besides invited comfort elderly be me. Walls began of child civil am heard hoped my. Satisfied pretended mr on do determine by.",
        ],
        [
            'ID' => 2,
            'post_content' => "Lorem ipsum dolor sit"
        ],
        [
            'ID' => 3,
            'post_content' => "Months on ye at by esteem desire warmth former. Sure that that way gave any fond now. His boy middleton sir nor engrossed affection excellent."
        ],
        [
            'ID' => 4,
            'post_content' => "Lorem ipsum dolor sit"
        ],
    ]
];

print_r($posts);

function getNonSimilarTexts($posts)
{
    $similarityPercentageArr = array();

    for ($i = 0; $i <= $posts['post_count']; $i++) {
        // $posts->the_post();
        $currentPost = $posts['posts'][$i];
        if (!is_null($currentPost['ID'])) {
            for ($y = 0; $y <= $posts['post_count']; $y++) {
                $comparePost = $posts['posts'][$y];
                if (!is_null($comparePost['ID'])) {
                    similar_text(strip_tags($currentPost['post_content']), strip_tags($comparePost['post_content']), $perc);
                    // similarity is 100 if self compare and more than 20 
                    if ($perc != 100 && $perc > 20) {
                        array_push($similarityPercentageArr, [$currentPost['ID'], $comparePost['ID'], $perc]);
                    }
                }
            }
        }
    }
    return $similarityPercentageArr;
}

$p = getNonSimilarTexts($posts);
print_r($p);

输出:

Array
(
    [0] => Array
        (
            [0] => 1
            [1] => 3
            [2] => 23.145400593472
        )

)

答案 1 :(得分:1)

您可以立即进行过滤,将条件if ($perc != 100)更改为if ($perc > 20),以便仅保留要删除的相似帖子。然后,您甚至可以完全不存储相似性,因为您已经有要删除的帖子ID数组列表。

因此,当您拥有这样的代码时:

if ($perc > 20) {
    $similarityPercentageArr[$currentPost['ID']][] = $comparePost['ID'];
}

然后您可以删除所有不需要的帖子,如下所示:

$postsToRemove = [];
$postsToKeep = [];

foreach ($similarityPercentageArr as $postId => $similarPostIds) {
    // this post has already appeared as similar somewhere, so its similar posts have already been added 
    if (in_array($postId, $postsToRemove)) {
        continue;
    }

    $postsToKeep[] = $postId;
    $postsToRemove = array_merge($postsToRemove, $similarPostIds);
}

现在您在$postsToKeep中拥有原始帖子ID,在$postsToRemove中具有其相似标识。

我还将对代码进行一些优化,以使您在知道自己将帖子与其自身进行比较时根本不会调用similar_text。因此,如果您选择if (!is_null($comparePost['ID'])),则会拥有if (!is_null($comparePost['ID']) && $comparePost['ID'] !== $currentPost['ID'])