我有一个包含50个数字的数组:
54,12,79,34,66,22,78,192,54,23,55,87,23,63... (up to 50)
还有一百万个50个数字的数组:
1: 76,34,67,4,12,34... (up to 50)
2: 34,12,68,97,55,33... (up to 50)
3: 21,65,87,23,65,45... (up to 50)
4: ....
5: (up to one million)
1)如何将其存储在优化的MySQL数据库中?
2)最重要的是,如何比较第一个数组以了解与其他数组的相似性?我的意思是......我希望:
Similarity to 1: 13%
Similarity to 2: 11%
Similarity to 3: 16%
...
相似性应该逐个运行......第一个与第一个,第二个与第二个,...然后生成50个元素的平均相似度。
答案 0 :(得分:1)
如果订单不重要,您可以将它们存储为已排序的数组:
1: 4,12,34,34,67,76... (up to 50)
2: 12,33,34,55,68,97... (up to 50)
3: 21,23,45,65,65,87... (up to 50)
因此,对于排序数组,您可以通过使用类似于排序数组合并算法的算法非常简单地获得任意两个序列之间的差异,因此通过排序可以获得O(n * logn)时间。
但如果你需要比较,如果有合理的上限和下限,你可以只列举所有序列的所有唯一数字,即:
0 => 4
1 => 12
2 = >21
3 => 23
4 => 33
5 => 34
6 => 45
7 => 55
8 => 65
9 => 67
10 => 68
11 => 76
12 => 87
将它们存储为一系列计数器,即:
1: 1100020001010
2: 0100110100101
3: 0011001020001
所以差异是一些不同的数字除以字符总数,但实际上我看不出这种方法的任何优点,因为它也有效或O(n * logn)。
答案 1 :(得分:0)
这应该让你去。如果您将使用的全部数组包含与您在初始问题中提到的相同数量的元素,则这是准确的。
<?php
$base = array(636,3305,705,3080,1895,3586,1879,817,3330,2884,487,1267,1016,2100,3598,2535,3894,2945,282,1182,3785,2489,3812,2829,1332,229,3577,125,2735,1126,1194,3366,430,1895,2446,2321,1480,325,3133,809,3204,3616,2071,220,1715,1669,2750,1608,613,3028);
$compare_a = array(355,3118,1293,2333,3632,2652,2677,1360,1295,1478,2742,1157,2545,2151,1593,3992,601,1913,1317,3728,581,3325,2612,1710,1430,1985,399,2731,2408,3821,1563,2759,2939,2852,1091,2570,1503,3764,3926,2794,1241,2668,3947,3782,818,1540,3774,1414,3449,1091);
$compare_b = array(1821,2179,1411,1559,193,3304,1484,2125,2722,1879,2031,2611,1142,928,1372,2140,1230,1498,1250,1362,287,3055,2933,186,3310,3397,3665,2196,691,7,3677,2508,2182,1088,66,2371,391,1546,495,3108,3421,2522,1719,563,3446,3087,2698,676,584,3944);
$compare_c = array(3354,3250,2884,1803,3844,1981,2882,1998,1196,1959,495,3514,3284,844,1848,2834,2415,459,3158,1862,1123,2334,491,3668,1136,406,4000,3854,2326,2169,2250,1680,1419,1133,3478,1262,3110,2359,3255,305,318,3745,3814,3598,589,1662,2431,2999,2116,1589);
$compare_d = array(1474,3489,2708,1704,2086,3248,2817,3403,467,3783,3208,3348,2426,595,3998,2089,2948,3546,189,2510,1723,1054,2364,3330,3480,3553,697,2268,3544,2338,374,1017,1827,3077,2717,3908,2325,1533,3310,2788,1316,2518,2135,3737,3109,2133,1826,2056,1678,2011);
$compare_e = array(2688,2677,3180,154,1614,3138,3234,3219,2160,3929,3951,2577,2157,1592,174,148,604,2921,1681,2425,1334,45,2550,2421,3833,47,716,2117,459,3702,3997,3142,2378,3177,3292,3988,2315,2525,3206,474,2453,3157,3047,610,748,3217,753,1347,2137,2430);
$similar = abs(((count(array_diff($base, $compare_a)) -count($base)) / count($base)) * 100);
print '1) $base compared with $compare_a is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_b)) -count($base)) / count($base)) * 100);
print '2) $base compared with $compare_b is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_c)) -count($base)) / count($base)) * 100);
print '3) $base compared with $compare_c is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_d)) -count($base)) / count($base)) * 100);
print '4) $base compared with $compare_d is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_e)) -count($base)) / count($base)) * 100);
print '5) $base compared with $compare_e is: '. $similar .'% similar to $base<br />';
?>
以上代码应该为您吐出:
1) $base compared with $compare_a is: 0% similar to $base
2) $base compared with $compare_b is: 2% similar to $base
3) $base compared with $compare_c is: 4% similar to $base
4) $base compared with $compare_d is: 2% similar to $base
5) $base compared with $compare_e is: 0% similar to $base
这实际上取决于你想用什么算法来确定相似性。在你的问题中,你说你希望将每个元素与同一位置的另一个元素进行比较。 PHP的内置函数array_diff()为您完成此任务。一个完整的示例将取决于您如何检索这些数组。我可以修改它来从数据库中提取数据,然后在循环中运行计算,或者其他什么。但是我需要更多细节才能在这方面为您提供帮助。