Array
(
[0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
[1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
[2] => The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.
[3] => Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
[4] => The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.
[5] => For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:
[6] => The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.
[7] => The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.
[8] => For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:
[9] => The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.
)
使用array_unique
可以删除的不完全相同,但是由另一个包含完全相同数据和更多数据但有时只是几个单词的元素所呈现的元素是不同的。
如何过滤这些?
答案 0 :(得分:12)
首先,问题不是那么简单,而且还不够好:你不想删除相同的元素,你想删除类似的元素,所以你的第一个问题是确定哪些元素是相似的。
鉴于相似性可能发生在字符串中的任何一点,它不足以要求它们以相同的字符集开头。例如,取这两句话(改编自你的问题):
Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
The rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
它们非常相似,没有以相同的字符串开头。确定相似性度量的一种方法是Smith–Waterman_algorithm,可以使用PHP实现here。
---后来编辑---
这里是使用内置similar_text()
的PHP的实现/**
* @param mixed $array input array
* @param int $minSimilarity minimum similarity for an item to be removed (percentage)
* @return array
*/
function applyFilter ($array, $minSimilarity = 90) {
$result = [];
foreach ($array as $outerValue) {
$append = true;
foreach ($result as $key => $innerValue) {
$similarity = null;
similar_text($innerValue, $outerValue, $similarity);
if ($similarity >= $minSimilarity) {
if (strlen($outerValue) > strlen($innerValue)) {
// always keep the longer one
$result[$key] = $outerValue;
}
$append = false;
break;
}
}
if ($append) {
$result[] = $outerValue;
}
}
return $result;
}
$test = [
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];
var_dump(applyFilter($test));
--- EOF后来编辑---
以下是Smith–Waterman_algorithm的完整工作代码:
class SmithWatermanGotoh
{
private $gapValue;
private $substitution;
/**
* Constructs a new Smith Waterman metric.
*
* @param gapValue
* a non-positive gap penalty
* @param substitution
* a substitution function
*/
public function __construct($gapValue=-0.5,
$substitution=null)
{
if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
//if(empty($substitution)) throw new Exception("substitution is required");
if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
else $this->substitution = $substitution;
$this->gapValue = $gapValue;
}
public function compare($a, $b)
{
if (empty($a) && empty($b)) {
return 1.0;
}
if (empty($a) || empty($b)) {
return 0.0;
}
$maxDistance = min(mb_strlen($a), mb_strlen($b))
* max($this->substitution->max(), $this->gapValue);
return $this->smithWatermanGotoh($a, $b) / $maxDistance;
}
private function smithWatermanGotoh($s, $t)
{
$v0 = [];
$v1 = [];
$t_len = mb_strlen($t);
$max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));
for ($j = 1; $j < $t_len; $j++) {
$v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
$this->substitution->compare($s, 0, $t, $j));
$max = max($max, $v0[$j]);
}
// Find max
for ($i = 1; $i < mb_strlen($s); $i++) {
$v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));
$max = max($max, $v1[0]);
for ($j = 1; $j < $t_len; $j++) {
$v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
$v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));
$max = max($max, $v1[$j]);
}
for ($j = 0; $j < $t_len; $j++) {
$v0[$j] = $v1[$j];
}
}
return $max;
}
}
class SmithWatermanMatchMismatch
{
private $matchValue;
private $mismatchValue;
/**
* Constructs a new match-mismatch substitution function. When two
* characters are equal a score of <code>matchValue</code> is assigned. In
* case of a mismatch a score of <code>mismatchValue</code>. The
* <code>matchValue</code> must be strictly greater then
* <code>mismatchValue</code>
*
* @param matchValue
* value when characters are equal
* @param mismatchValue
* value when characters are not equal
*/
public function __construct($matchValue, $mismatchValue) {
if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");
$this->matchValue = $matchValue;
$this->mismatchValue = $mismatchValue;
}
public function compare($a, $aIndex, $b, $bIndex) {
return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
: $this->mismatchValue);
}
public function max() {
return $this->matchValue;
}
public function min() {
return $this->mismatchValue;
}
}
/**
* @param mixed $array input array
* @param int $minSimilarity minimum similarity for an item to be removed (percentage)
* @return array
*/
function applyFilter ($array, $minSimilarity = 90) {
$swg = new SmithWatermanGotoh();
$result = [];
foreach ($array as $outerValue) {
$append = true;
foreach ($result as $key => $innerValue) {
$similarity = $swg->compare($innerValue, $outerValue) * 100;
if ($similarity >= $minSimilarity) {
if (strlen($outerValue) > strlen($innerValue)) {
// always keep the longer one
$result[$key] = $outerValue;
}
$append = false;
break;
}
}
if ($append) {
$result[] = $outerValue;
}
}
return $result;
}
$test = [
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];
var_dump(applyFilter($test));
现在您只需根据需要调整$ minSimilarity变量即可。例如,在您的情况下,如果保持默认的90%,将删除第一个元素(类似于第二个到99.86%度)。但是,设置较低的值(80%)也会删除第8个元素。
希望它有所帮助!
答案 1 :(得分:1)
您仍然可以使用array_filter
并使用自定义回调,使用substr_count
查找数组中的值是否多于一次
$input = array("a","b","c","d","ax","cz");
$str = implode("|",array_unique($input));
$output = array_filter($input, function($var) use ($str){
return substr_count($str, $var) == 1;
});
print_r($output);
答案 2 :(得分:1)
假设值始终出现在最开头,您可以执行以下操作:
$arr = ["Some Text.", "Some Text. And more details."];
foreach($arr as $key => $value) {
// Look for the value in every element
foreach($arr as $key2 => $value2) {
// Remove element if its value appears at the beginning of another element
if ($key !== $key2 && strpos($value2, $value) === 0) {
unset($arr[$key]);
continue 2;
}
}
}
// Re-index array
$arr = array_values($arr);
如果元素顺序是另一种方式,这也可以。
答案 3 :(得分:1)
有时只会说几句话。
正如您所说,很少有单词可以与另一个单词不同。但是在编程中你需要精确的条件来过滤。
您可以输入匹配百分比来过滤
这是一个基本的例子,你可以从中获得想法。
<?php
$data = ["this is test","this is another test","one test","two test","this is two test"];
$percentageMatched = 100;//Here you can put your percentage matched to delete
for($i=0;$i<count($data)-1;$i++){
$value = explode(" ",$data[$i]);
/* check each word in another text */
for($k=$i+1;$k<count($data);$k++){
$nextArray = explode(" ",$data[$k]);
$foundCount = 0;
for($j=0;$j<count($value);$j++){
if(in_array($value[$j],$nextArray)){
$foundCount++;
}
}
$fromLine = $i;
$toLine = $k;
$percentage = $foundCount/count($value)*100;
echo "EN $fromLine matched $percentage % with EN $toLine \n";
if($percentage >= $percentageMatched){
$data[$i] = "";
break;
//array_values($data);
}
}
echo ".............\n";
}
print_r(array_filter($data));
?>
如果输入数据是:
Array
(
[0] => this is test
[1] => this is another test
[2] => one test
[3] => two test
[4] => this is two test
)
它提供输出:100%matched percentage
此处索引0和3匹配100%并过滤掉
EN 0 matched 100 % with EN 1
.............
EN 1 matched 25 % with EN 2
EN 1 matched 25 % with EN 3
EN 1 matched 75 % with EN 4
.............
EN 2 matched 50 % with EN 3
EN 2 matched 50 % with EN 4
.............
EN 3 matched 100 % with EN 4
.............
Array
(
[1] => this is another test
[2] => one test
[4] => this is two test
)
答案 4 :(得分:0)
使用array_filter
是个不错的选择
$temp = "";
function prefixmatch($x){
global $temp;
$temp = $x;
// do an optimist linear search to determine if there's a prefix match
$bool = true;
for($i=0; $i < min([strlen($x), strlen($temp)]); $i++){
$bool = $bool & ($x[i] === $temp[i]);
}
// negate the result just because of array_filter
return(!$bool);
}
print_r(array_filter($array1, "prefixmatch"));
答案 5 :(得分:0)
我认为词干和词形还原可以在这种情况下有所帮助。如果我们采用数组中前两个元素的情况,唯一的区别是单数&#39; tape&#39;和复数的录音带&#39;
Array
(
[0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
[1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
如果你将字符串标记为字符串并通过像Php Stemmer这样的词干分析器传递它,那么这两个词都是&#39; tape&#39;和#&#39;录像带&#39;将减少到他们的词干,即&#39; tape&#39;。发布词干后,您可以比较您的数组元素。我相信它会删除许多冗余元素。
您还可以更进一步,对字符串执行Lemmatisation。例如,用英语,动词“走路”。可能会显示为“步行”,“步行”,“步行”,“步行”等。可能在字典中查找的基本形式“走路”被称为单词(来自wiki)的引理。
我个人使用Stanford NLP java。还有一个Php实现PHP-Stanford-NLP
答案 6 :(得分:0)
解决方案取决于您对&#34;相似性的定义&#34;和数据集。它可能与一个上下文真的不同。
可以满足您需求的一个解决方案是cosine similarity。以下是代码示例:Cosine similarity vs Hamming distance
答案 7 :(得分:-4)
在PHP中,您可以使用array_unique
方法从数组中删除重复项。
来自php.net的例子:
<?php
$input = array("a" => "green", "red", "b" => "green", "blue", "red");
$result = array_unique($input);
print_r($result);
?>
输出结果为:
Array
(
[a] => green
[0] => red
[1] => blue
)
希望这就是你要找的东西