确定范围是否包含不区分大小写的搜索短语的算法?

时间:2013-07-02 06:58:32

标签: algorithm language-agnostic range case-insensitive

假设您按以下方式组织了数千个文件:首先按文件名排序(区分大小写,以便大写文件在小写之前),然后将它们分组到包含名称的文件夹中该文件夹中的第一个和最后一个文件。例如,文件夹可能如下所示:

Abel -> Cain
Camel -> Sloth
Stork -> basket
basking -> sleuth
tiger -> zebra

现在,给定不区分大小写的搜索字符串s,确定哪些文件夹可以包含与s匹配的文件。您不能也不必查看文件夹 - 文件实际上不必存在。

一些例子:

("Abel", "Cain")    matches s = "blue",   since it contains "Blue"
("Stork", "basket") matches s = "arctic", since it contains "arctic"
("FA", "Fb")        matches s = "foo",    since it contains "FOo"
("Fa", "Fb") does NOT match s = "foo"

正式:给定一个封闭范围[a,b]和一个小写字符串s,确定c中是否有任何字符串[a,b]lower(c) = s。< / p>

我的第一个预感是对范围的边界进行不区分大小写的搜索。但从最后一个例子可以很容易地看出这是不正确的。

布鲁斯力解决方案是生成所有潜在的文件名。例如,输入字符串"abc"将生成候选"ABC", "ABc", "AbC", "Abc", "aBC", "aBc", "abC", "abc"。然后你只需要测试每个边界。下面将介绍这种强力解决方案的一个例子。这是O(2^n)

我的问题是,如果有一个快速正确的算法吗?


Clojure中的暴力解决方案:

(defn range-contains 
  [first last string]
  (and (<= (compare first string) 0)
       (>= (compare last string) 0)))

(defn generate-cases
  "Generates all lowercase/uppercase combinations of a word"
  [string]
  (if (empty? string)
    [nil]
    (for [head [(java.lang.Character/toUpperCase (first string))
                (java.lang.Character/toLowerCase (first string))]
          tail (generate-cases (rest string))]
      (cons head tail))))

(defn range-contains-insensitive 
  [first last string]
  (let [f (fn [acc candidate] (or acc (range-contains first last (apply str candidate))))]
    (reduce f false (generate-cases string))))

(fact "Range overlapping case insensitive"
  (range-contains-insensitive "A" "Z" "g") => true
  (range-contains-insensitive "FA" "Fa" "foo") => true
  (range-contains-insensitive "b" "z" "a") => false
  (range-contains-insensitive "B" "z" "a") => true)

2 个答案:

答案 0 :(得分:1)

我认为不是创建所有大小写组合,而是可以通过分别检查每个字符的upper,然后降低来解决,这会将2 ^ N更改为2N。

这个想法如下:

  • 保持“lowdone”和“highdone”标志,指示s 肯定 是否在低限制之后仍然可能在高限之前,反之亦然
  • 逐字逐句地通过字符串
  • 检查当前字母的大写版本是否可以在相应的下限字母之后同时出现在上限字母之前,然后检查相同的小写字母,如果两个字母都不满足两个条件,返回false(如果“lowdone”为true,则不检查下限,如果“highdone”为true,则不检查上限 - 比较ABC和ACA时,一旦超过第二个字母,我们不关心第三个字母)
  • 如果案件满足这两个条件,请检查在下限字母后是否严格或者下限太短而没有相应的字母,如果是,则lowdone = true
  • 类似于highdone = true

听起来不错吗? C#中的代码(可能写得更简洁):

        public Bracket(string l, string u)
        {
            Low = l;
            High = u;
        }

        public bool IsMatch(string s)
        {
            string su = s.ToUpper();
            string sl = s.ToLower();

            bool lowdone = false;
            bool highdone = false;
            for (int i = 0; i < s.Length; i++)
            {
                char[] c = new char[]{su[i], sl[i]};

                bool possible = false;
                bool ld = lowdone;
                bool hd = highdone;
                for (int j = 0; j < 2; j++)
                {
                    if ((lowdone || i >= Low.Length || c[j] >= Low[i]) && (highdone || i >= High.Length || c[j] <= High[i]))
                    {
                        if (i >= Low.Length || c[j] > Low[i])
                            ld = true;

                        if (i >= High.Length || c[j] < High[i])
                            hd = true;

                        possible = true;
                    }
                }
                lowdone = ld;
                highdone = hd;

                if (!possible)
                    return false;
            }

            if (!lowdone && Low.Length > s.Length)
                return false;

            return true;
        }
    }

答案 1 :(得分:0)

本着完全公开的精神,我想我还应该添加我想出的算法(Java,使用Guava):

public static boolean inRange(String search, String first, String last) {
    int len = search.length();
    if (len == 0) {
        return true;
    }

    char low = Strings.padEnd(first, len, (char) 0).charAt(0);
    char high = Strings.padEnd(last, len, (char) 0).charAt(0);

    char capital = Character.toLowerCase(search.charAt(0));
    char small = Character.toUpperCase(search.charAt(0));

    if (low == high) {
        if (capital == low || small == low) {
            // All letters equal - remove first letter and restart
            return inRange(search.substring(1), first.substring(1), last.substring(1));
        }
        return false;
    }

    if (containsAny(Ranges.open(low, high), capital, small)) {
        return true; // Definitely inside
    }

    if (!containsAny(Ranges.closed(low, high), capital, small)) {
        return false; // Definitely outside
    }

    // Edge case - we are on a bound and the bounds are different
    if (capital == low || small == low) {
        return Ranges.atLeast(first.substring(1)).contains(search.substring(1).toLowerCase());
    }
    else {
        return Ranges.lessThan(last.substring(1)).contains(search.substring(1).toUpperCase());
    }
}

private static <T extends Comparable<T>> boolean containsAny(Range<T> range, T value1, T value2) {
    return range.contains(value1) || range.contains(value2);
}