Question

我们需要通过连接在数据库中组合3列。但是，3列可能包含重叠部分，不应复制部分。例如，

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"

我们当前的算法非常慢，因为它使用强力来识别2个字符串之间的重叠部分。有没有人知道这样做的有效算法？

假设我们有2个字符串A和B.算法需要找到最长的公共子串S，以便A以S结尾，B以S开头。

我们目前在Java中实施的暴力实施是为了参考，

public static String concat(String s1, String s2) {
    if (s1 == null)
        return s2;
    if (s2 == null)
        return s1;
    int len = Math.min(s1.length(), s2.length());

    // Find the index for the end of overlapping part
    int index = -1;
    for (int i = len; i > 0; i--) {
        String substring = s2.substring(0, i);
        if (s1.endsWith(substring)) {
            index = i;
            break;
        }
    }
    StringBuilder sb = new StringBuilder(s1);
    if (index < 0) 
        sb.append(s2);
    else if (index <= s2.length())
        sb.append(s2.substring(index));
    return sb.toString();
}

Answer 1

其他大多数答案都集中在恒定因子优化上，但也可以渐近地做得更好。看看你的算法：它是O（N ^ 2）。这似乎是一个可以解决得比这更快的问题！

考虑Knuth Morris Pratt。它跟踪到目前为止我们匹配的最大子字符串数量。这意味着它知道在S2 结束时已经匹配了多少S1 ，这就是我们正在寻找的价值！只需修改算法继续，而不是在早期匹配子字符串时返回，并让它返回匹配的数量而不是最后的0。

这给你一个O（n）算法。尼斯！

int OverlappedStringLength(string s1, string s2) { //Trim s1 so it isn't longer than s2 if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length); int[] T = ComputeBackTrackTable(s2); //O(n) int m = 0; int i = 0; while (m + i < s1.Length) { if (s2[i] == s1[m + i]) { i += 1; //<-- removed the return case here, because |s1| <= |s2| } else { m += i - T[i]; if (i > 0) i = T[i]; } } return i; //<-- changed the return here to return characters matched } int[] ComputeBackTrackTable(string s) { var T = new int[s.Length]; int cnd = 0; T[0] = -1; T[1] = 0; int pos = 2; while (pos < s.Length) { if (s[pos - 1] == s[cnd]) { T[pos] = cnd + 1; pos += 1; cnd += 1; } else if (cnd > 0) { cnd = T[cnd]; } else { T[pos] = 0; pos += 1; } } return T; }

OverlappedStringLength（“abcdef”，“defghl”）返回3

Answer 2

您可以使用DFA。例如，正则表达式XYZ应该读取字符串^((A)?B)?C。该正则表达式将匹配与XYZ字符串的后缀匹配的最长前缀。使用这样的正则表达式，您可以匹配并获得匹配结果，也可以生成DFA，您可以使用该状态指示“剪切”的正确位置。

在Scala中，第一个实现 - 直接使用正则表达式 - 可能是这样的：

def toRegex(s1: String) = "^" + s1.map(_.toString).reduceLeft((a, b) => "("+a+")?"+b) r
def concatWithoutMatch(s1 : String, s2: String) = {
  val regex = toRegex(s1)
  val prefix = regex findFirstIn s2 getOrElse ""
  s1 + s2.drop(prefix length)
}

例如：

scala> concatWithoutMatch("abXabXabXac", "XabXacd")
res9: java.lang.String = abXabXabXacd

scala> concatWithoutMatch("abc", "def")
res10: java.lang.String = abcdef

scala> concatWithoutMatch(concatWithoutMatch("abcde", "defgh"), "ghlmn")
res11: java.lang.String = abcdefghlmn

Answer 3

如何（原谅C＃）：

public static string OverlapConcat(string s1, string s2)
{
    // Handle nulls... never return a null
    if (string.IsNullOrEmpty(s1))
    {
        if (string.IsNullOrEmpty(s2))
            return string.Empty;
        else
            return s2;
    }
    if (string.IsNullOrEmpty(s2))
        return s1;

    // Checks above guarantee both strings have at least one character
    int len1 = s1.Length - 1;
    char last1 = s1[len1];
    char first2 = s2[0];

    // Find the first potential match, bounded by the length of s1
    int indexOfLast2 = s2.LastIndexOf(last1, Math.Min(len1, s2.Length - 1));
    while (indexOfLast2 != -1)
    {
        if (s1[len1 - indexOfLast2] == first2)
        {
            // After the quick check, do a full check
            int ix = indexOfLast2;
            while ((ix != -1) && (s1[len1 - indexOfLast2 + ix] == s2[ix]))
                ix--;
            if (ix == -1)
                return s1 + s2.Substring(indexOfLast2 + 1);
        }

        // Search for the next possible match
        indexOfLast2 = s2.LastIndexOf(last1, indexOfLast2 - 1);
    }

    // No match found, so concatenate the full strings
    return s1 + s2;
}

这个实现不会创建任何字符串副本（部分或其他），直到它确定需要复制的内容，这应该有助于提高性能。

此外，匹配检查首先测试可能匹配区域的极值（2个单个字符），在正常的英文文本中应该很有可能避免检查任何其他字符是否存在不匹配。

只有一旦它建立了它可以进行的最长匹配，或者根本不可能匹配，就会连接两个字符串。我在这里使用了简单的'+'，因为我认为算法其余部分的优化已经消除了原始版本中的大部分低效率。尝试一下，让我知道它是否适合您的目的。

Answer 4

这是Python的解决方案。不需要一直在内存中构建子串就应该更快。这项工作在_concat函数中完成，该函数连接两个字符串。 concat函数是一个连接任意数量字符串的助手。

def concat(*args):
    result = ''
    for arg in args:
        result = _concat(result, arg)
    return result

def _concat(a, b):
    la = len(a)
    lb = len(b)
    for i in range(la):
        j = i
        k = 0
        while j < la and k < lb and a[j] == b[k]:
            j += 1
            k += 1
        if j == la:
            n = k
            break
    else:
        n = 0
    return a + b[n:]

if __name__ == '__main__':
    assert concat('a', 'b', 'c') == 'abc'
    assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
    assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
    assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
    assert concat('abcdef', '', 'defghl') == 'abcdefghl'

Answer 5

或者您可以使用以下存储函数在mysql中执行此操作：

DELIMITER //

DROP FUNCTION IF EXISTS concat_with_overlap //

CREATE FUNCTION concat_with_overlap(a VARCHAR(100), b VARCHAR(100))
  RETURNS VARCHAR(200) DETERMINISTIC
BEGIN 
  DECLARE i INT;
  DECLARE al INT;
  DECLARE bl INT;
  SET al = LENGTH(a);
  SET bl = LENGTH(a);
  IF al=0 THEN 
    RETURN b;
  END IF;
  IF bl=0 THEN 
    RETURN a;
  END IF;
  IF al < bl THEN
     SET i = al;
  ELSE
     SET i = bl;
  END IF;

  search: WHILE i > 0 DO
     IF RIGHT(a,i) = LEFT(b,i) THEN
    RETURN CONCAT(a, SUBSTR(b,i+1));
     END IF;
     SET i = i - 1;
  END WHILE search;

  RETURN CONCAT(a,b);
END//

我尝试了你的测试数据：

mysql> select a,b,c,
    -> concat_with_overlap( concat_with_overlap( a, b ), c ) as result 
    -> from testing //
+-------------+---------+--------+-------------+
| a           | b       | c      | result      |
+-------------+---------+--------+-------------+
| a           | b       | c      | abc         |
| abcde       | defgh   | ghlmn  | abcdefghlmn |
| abcdede     | dedefgh |        | abcdedefgh  |
| abcde       | d       | ghlmn  | abcdedghlmn |
| abcdef      |         | defghl | abcdefghl   |
| abXabXabXac | XabXac  |        | abXabXabXac |
+-------------+---------+--------+-------------+
6 rows in set (0.00 sec)

Answer 6

我认为这很快：

你有两个字符串，string1和string2。通过string1向后（从右到左）查找string2的第一个字符。一旦你有这个位置，确定是否有重叠。如果没有，你需要继续搜索。如果有，你需要确定是否有可能进行另一场比赛。

要做到这一点，只需探索两个字符串中较短的一个，以便重复出现重叠字符。 ie：如果string1中匹配的位置留下了一个短的string1，则从string1中的新起点重复初始搜索。相反，如果string2的不匹配部分较短，请搜索重复字符的重复。

根据需要重复。

完成工作！

这在内存分配方面不需要太多（所有搜索都在适当的地方完成，只需要分配结果字符串缓冲区）并且只需要（最多）一个字符串重叠的一次传递。

Answer 7

我正在努力使这个C＃尽可能令人愉快。

    public static string Concatenate(string s1, string s2)
    {
        if (string.IsNullOrEmpty(s1)) return s2;
        if (string.IsNullOrEmpty(s2)) return s1;
        if (s1.Contains(s2)) return s1;
        if (s2.Contains(s1)) return s2;

        char endChar = s1.ToCharArray().Last();
        char startChar = s2.ToCharArray().First();

        int s1FirstIndexOfStartChar = s1.IndexOf(startChar);
        int overlapLength = s1.Length - s1FirstIndexOfStartChar;

        while (overlapLength >= 0 && s1FirstIndexOfStartChar >=0)
        {
            if (CheckOverlap(s1, s2, overlapLength))
            {
                return s1 + s2.Substring(overlapLength);
            }

            s1FirstIndexOfStartChar = 
                s1.IndexOf(startChar, s1FirstIndexOfStartChar);
            overlapLength = s1.Length - s1FirstIndexOfStartChar;

        }

        return s1 + s2;
    }

    private static bool CheckOverlap(string s1, string s2, int overlapLength)
    {
        if (overlapLength <= 0)
            return false;

        if (s1.Substring(s1.Length - overlapLength) == 
            s2.Substring(0, overlapLength))
            return true;

        return false;            
    }

编辑：我发现这与jerryjvl的解决方案几乎相同。唯一的区别是，这将适用于“abcde”，“d”案例。

Answer 8

为什么不做这样的事情。首先得到三列中的第一个字符或单词（表示重叠）。

然后，开始将第一个字符串添加到stringbuffer，一次一个字符。

每次查看是否到达了与第二个或第三个字符串重叠的部分。

如果是，则开始连接也包含第一个字符串中的内容的字符串。

完成开始后，如果没有重叠，则从第二个字符串开始，然后是第三个字符串。

所以在问题的第二个例子中，我将把d和g保存在两个变量中。

然后，当我添加第一个字符串 abc来自第一个字符串，然后我看到d也在第二个字符串中，所以我转向从第二个字符串添加 def从字符串2添加，然后我继续并用字符串3结束。

如果您在数据库中执行此操作，为什么不使用存储过程来执行此操作？

Answer 9

如果你在数据库之外进行，请尝试perl：

sub concat {
  my($x,$y) = @_;

  return $x if $y eq '';
  return $y if $x eq '';

  my($i) = length($x) < length($y) ?  length($x) : length($y);
  while($i > 0) {
      if( substr($x,-$i) eq substr($y,0,$i) )  {
          return $x . substr($y,$i);
      }
      $i--;
  }
  return $x . $y;
}

它与你的算法完全相同，我只是古怪，如果java或perl更快; - ）

Answer 10

这个问题似乎是最长公共子序列问题的变体，可以通过动态编程来解决。

http://www.algorithmist.com/index.php/Longest_Common_Subsequence

Answer 11

这是一个perl -pseudo oneliner：

$ _ = s1.s2;

S /（[\ S] +）\ 1 / \ 1 /;

perl正则表达式非常有效，你可以查找他们正在使用的算法，但他们肯定会实现某种类型的FSM等，所以会得到相当不错的O（..）。

Answer 12

这是一个Java实现，它找到两个长度为N和M的字符串之间的最大重叠，例如O（min（N，M））操作~O（N）。

我和@ sepp2k有同样的想法：s现在删除了答案，并且进一步研究了它。似乎工作正常。我们的想法是迭代第一个字符串并在找到与第二个字符串的开头匹配的内容后开始跟踪。如果错误和真实匹配重叠，则可能需要执行多个同时跟踪。最后，您选择最长的赛道。

我还没有找出绝对最糟糕的情况，比赛之间有最大的重叠，但我不希望它失控，因为我认为你不能重叠任意多场比赛。通常情况下，您一次只能跟踪一到两场比赛：一旦出现不匹配，候选人就会被删除。

.toggle

还有一些测试：

static class Candidate {
    int matchLen = 0;
}

private String overlapOnce(@NotNull final String a, @NotNull final String b) {
    final int maxOverlap = Math.min(a.length(), b.length());
    final Collection<Candidate> candidates = new LinkedList<>();
    for (int i = a.length() - maxOverlap; i < a.length(); ++i) {
        if (a.charAt(i) == b.charAt(0)) {
            candidates.add(new Candidate());
        }
        for (final Iterator<Candidate> it = candidates.iterator(); it.hasNext(); ) {
            final Candidate candidate = it.next();
            if (a.charAt(i) == b.charAt(candidate.matchLen)) {
                //advance
                ++candidate.matchLen;
            } else {
                //not matching anymore, remove
                it.remove();
            }
        }

    }
    final int matchLen = candidates.isEmpty() ? 0 :
            candidates.stream().map(c -> c.matchLen).max(Comparator.comparingInt(l -> l)).get();
    return a + b.substring(matchLen);
}

private String overlapOnce(@NotNull final String... strings) {
    return Arrays.stream(strings).reduce("", this::overlapOnce);
}

具有重叠的字符串连接的高效算法

12 个答案: