Question

即使准确率达到90％，您如何编写与字符串匹配的正则表达式模式？

例如：

$search_string = "Two years in,&nbsp;the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. "

$subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S."

最终结果是$search_string与$subject匹配并返回true，即使它们不是100％相同。

Answer 1

你可以在正则表达式模式上有一些可选部分。例如：

$search_string = "A tiny little bear";
$regex = "A ([a-zA-Z]+)? little bear";

？字符表示该组之前是可选的，而 [a-zA-Z] + 表示其中会有一个或多个字母

因此，使用 preg_match 可以获得不是100％限制的验证。

Answer 2

如果任何人到处寻找正确的方式来做到这一点

   $search_string = "Two years in,&nbsp;the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. ";

$subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S.";

  similar_text ($search_string,$subject,$sim);

  echo 'text is: ' .round($sim). '% similar';

结果：

文字是：85％相似

您可以使用结果来确定在特定情况下匹配的值是这样的：

similar_text($search_string,$subject,$sim);

    if($sim >=85){

    echo 'MATCH';

    }

Answer 3

只是为了笑容，我尝试使用Perl。

有关使用正则表达式解析html的所有警告都适用：
（不应该在html上使用。）

这将在html或实体或空格上拆分搜索字符串之后，使用修饰符.*?将部件与(?is)连接在一起。

这不是真正的部分匹配子串正则表达式因为
它要求所有部件都存在 然而，这确实克服了它们之间的距离或内容可能，通过一些算法工作，它可以在这样的情况下进行调整部分是可选的，以聚类的形式。

use strict;
use warnings;

my $search_string = "Two years in,&nbsp;the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. ";

my $subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S.";


## Trim leading/trailing whitespace from $search_string

  $search_string =~ s/^\s+|\s+$//g;

## Split the $search_string on html tags or entities or whitespaces ..

  my @SearchParts = split m~

    \s+|
    (?i)[&%](?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a
    -f]+)));|<(?:script(?:\s+(?:"[\S\s]*?"|'
    [\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script
    \s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:
    (?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?
    ))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE
    [\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:-
    -[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTI
    TY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

  ~x, $search_string;

## Escape the metacharacters from SearchParts

  @SearchParts = grep { $_ = quotemeta } @SearchParts;

## Join the SearchParts into a regex 

  my $rx = '(?si)(?:' . ( join '.*?', @SearchParts ) . ')';

## Try to match SearchParts in the $subject 

  if ( $subject =~ /$rx/ )
  {
     print "Match in subject:\n'$&' \n";
  }

输出：

Match in subject:
'Two years in,the company has expanded to 35 cities, five of which are outside the U.S.'

修改
作为旁注，@ SearchParts 的每个元素都可以进一步分割//
再次（在每个角色上），加入.*? 这将进入 true 部分匹配的领域虽然每个角色都需要匹配，但并不完全相同订单保持不变，但每个订单都必须是可选的通常，没有捕获组，就无法确定百分比实际信件的匹配但是，如果你要使用 Perl ，那么它很容易计算正则表达式代码构造(?{{..}})，其中计数器可以递增我想，在那一点上它变得不便携。最好使用C ++。

RegEx匹配表达式，即使它不是100％相同

3 个答案: