搜索序列中匹配单词的字符串列表

时间:2018-04-10 23:46:51

标签: c# arrays regex list

我从外部来源获得List strings始终更改。

我想搜索每个字符串,在所有字符串之间找到匹配的单词in sequence

然后从每个字符串中删除这些单词组,只留下书的标题。

实施例

  

这本名为“指环王”的书是经典之作   这本名为“战争与和平”的书是经典之作   这本名为The Three Musketeers的书是经典之作。

The book named将被删除 is a classic.将被删除 The book named The序列未被移除,因为War and Peace不以The开头。

序列必须出现在所有字符串之间才能被删除。

  

指环王
  战争与和平
  三个火枪手

这是一个示例列表。我想在除书名以外的字符串上使用它。

如:

  

我去了家得宝   我去了Walgreens   我去了百思买。

I went to已删除。

  

篮球队洛杉矶湖人队是我的最爱   篮球队纽约尼克斯队是我的最爱   篮球队芝加哥公牛队是我的最爱。

The basketball team已删除 are my favorite.已被删除。

解决方案

我的想法是从头开始搜索字符串,将匹配的单词分组,直到找到不匹配的单词,找到前缀。

然后从字符串末尾开始向后搜索,找到后缀。

它会在中间显示标题。

但我不知道如何去做。

C#

List<string> sentences = new List<string>() 
{ 
    "The book named The Lord of the Rings is a classic.",
    "The book named War and Peace is a classic.",
    "The book named The Three Musketeers is a classic.",
};

List<string> titles = new List<string>() 


for (int i = 0; i < sentences.Count; i++)
{
    // Add Titles to their own List
    //
    titles.Add(FindTitle(sentence[i]));
}


String FindTitle(string sentence) 
{
    string title = string.Empty;

    // compare all strings in List
    // group common word sequences prefix (The book named)
    // group common word sequences suffix (is a classic.)
    // remove those word sequences from each string in List

    return title;
}

2 个答案:

答案 0 :(得分:1)

这是我的方法。我采取了性能路线 - 我猜仍然可以优化。

已编辑:使用regex.Escape帮助解决特殊字符情况。

使用秒表来计算我的v / s Rufus L'的解决方案。

enter image description here

使用 - 鲁弗斯的测试句输入:

private static List<List<string>> GetTestSentences()
{
    return new List<List<string>>
    {
        new List<string>()
        {
            "The book named The Lord of the Rings is a classic.",
            "The book named War and Peace is a classic.",
            "The book named The Three Musketeers is a classic.",
        },
        new List<string>
        {
            "I went to The Home Depot.",
            "I went to Walgreens.",
            "I went to Best Buy."
        },
        new List<string>
        {
            "The basketball team Los Angeles Lakers are my favorite.",
            "The basketball team New York Knicks are my favorite.",
            "The basketball team Chicago Bulls are my favorite."
        },
        new List<string>()
        {
            "The book named Lord of the Flies is a classic (500 This is a test)",
            "The book named Wuthering Heights is a classic (500 This is a test)",
            "The book named Great Expectations is a classic (500 This is a test)",
            "The book named The Lord of the Rings is a classic (500 This is a test)",
            "The book named War and Peace is a classic (500 This is a test)"
        }
    };
}

从主要方法做:

foreach (var sentenceList in GetTestSentences())
{
    var prefix = FindMatchingPattern(sentenceList[0], sentenceList[1], true);
    var suffix = FindMatchingPattern(sentenceList[0], sentenceList[1], false);

    if (prefix.Length > 0)
        prefix = Regex.Escape(prefix);
    if (suffix.Length > 0)
        suffix = Regex.Escape(suffix);

    foreach (var item in sentenceList)
    {
        var result = Regex.Replace(item, prefix, string.Empty);
        result = Regex.Replace(result, suffix, string.Empty);
        Console.WriteLine($"{item} --> {result}");
    }
    Console.WriteLine(new string('-', Console.WindowWidth));
}

这是神奇的方法:

private static string FindMatchingPattern(string sample1, string sample2, bool forwardDirection)
{
    string shorter = string.Empty;
    string longer = string.Empty;

    if (sample1.Length <= sample2.Length)
    {
        shorter = sample1;
        longer = sample2;
    }
    else
    {
        shorter = sample2;
        longer = sample1;
    }

    StringBuilder matchingPattern = new StringBuilder();
    StringBuilder wordHolder = new StringBuilder();

    if (forwardDirection)
    {
        for (int idx = 0; idx < shorter.Length; idx++)
        {
            if (shorter[idx] == longer[idx])
                if (shorter[idx] == ' ')
                {
                    matchingPattern.Append(wordHolder + " ");
                    wordHolder.Clear();
                }
                else
                    wordHolder.Append(shorter[idx]);
            else
                break;
        }
    }
    else
    {
        while (true)
        {
            if (shorter.Length > 0 && shorter[shorter.Length - 1] == longer[longer.Length - 1])
            {
                if (shorter[shorter.Length - 1] == ' ')
                {
                    matchingPattern.Insert(0, " " + wordHolder);
                    wordHolder.Clear();
                }
                else
                    wordHolder.Insert(0, shorter[shorter.Length - 1]);

                shorter = shorter.Remove(shorter.Length - 1, 1);
                longer = longer.Remove(longer.Length - 1, 1);
            }
            else
            {
                break;
            }
        }
    }

    return matchingPattern.ToString();
}

答案 1 :(得分:1)

更新我修改了示例数据以包含不同类型的测试,并修改了RemoveCommonPrefixAndSuffix以处理这些新测试。

我发现只比较前两个字符串的公共前缀和后缀可能是错误的,如果前两本书(或任何主题)开始和/或以相同的单词结尾。

例如:

new List<string>()
{
    "The book named Lord of the Rings 2 is a classic.",
    "The book named Lord of the Flies 2 is a classic.",
    "The book named This is pretty is a classic.",                
    "The book named War and Peace is a classic.",
    "The book named The Three Musketeers is a classic.",                
},

在这里,如果我们只比较前两个句子,我们确定公共前缀是"The book named Lord of the",这是不正确的。我们还确定公共后缀为"2 is a classic.",这也是不正确的。

这是一个通过确保所有句子具有相同前缀和后缀来解决此问题的解决方案:

public static List<string> RemoveCommonPrefixAndSuffix(List<string> sentences,
    int minSeqenceLength = 2)
{
    if (sentences == null) return null;

    if (sentences.Count < 2 ||
        sentences.Any(s => s.Count(c => c == ' ') < minSeqenceLength - 1))
    {
        return sentences.ToList();
    }

    if (sentences.All(s => s == sentences[0]))
    {
        return sentences.Select(s => string.Empty).ToList();
    }

    var sentenceWords = sentences.Select(s => s.Split()).ToList();
    var firstSentence = sentenceWords[0];
    var length = sentenceWords.Min(s => s.Length);
    var commonPrefix = new StringBuilder();
    var commonSuffix = new StringBuilder();
    var prefixDone = false;
    var suffixDone = false;

    for (var i = 0; i < length && !(prefixDone && suffixDone); i++)
    {
        if (!prefixDone && sentenceWords.All(s => s[i] == firstSentence[i]))
        {
            commonPrefix.Append(firstSentence[i] + " ");
        }
        else
        {
            prefixDone = true;
        }

        if (!suffixDone && sentenceWords.All(s =>
            s[s.Length - i - 1] == firstSentence[firstSentence.Length - i - 1]))
        {
            commonSuffix.Insert(0, firstSentence[firstSentence.Length - i - 1] + " ");
        }
        else
        {
            suffixDone = true;
        }
    }

    var prefix = commonPrefix.ToString().Count(c => c == ' ') >= minSeqenceLength - 1
        ? commonPrefix.ToString()
        : string.Empty;

    var suffix = commonSuffix.ToString().Count(c => c == ' ') >= minSeqenceLength - 1
        ? commonSuffix.ToString()
        : string.Empty;

    var commonLength = prefix.Length + suffix.Length;

    return sentences
        .Select(s => s.Length > commonLength
            ? s.Substring(prefix.Length, s.Length - prefix.Length - suffix.Length)
            : string.Empty)
        .ToList();
}

以下是获取测试数据的方法:

private static List<List<string>> GetTestSentences()
{
    return new List<List<string>>
    {
        // Prefix-only test
        new List<string>
        {
            "I went to The Home Depot",
            "I went to Walgreens",
            "I went to Best Buy",
        },
        // Suffix-only test
        new List<string>
        {
            "Game of Thrones is a good TV series",
            "Breaking Bad is a good TV series",
            "The Office is a good TV series",
        },
        // Prefix / Suffix test
        new List<string>
        {
            "The basketball team Los Angeles Lakers are my favorite",
            "The basketball team New York Knicks are my favorite",
            "The basketball team Chicago Bulls are my favorite",
        },
        // No prefix or suffix - all sentences are different
        new List<string>
        {
            "I went to The Home Depot",
            "Game of Thrones is a good TV series",
            "The basketball team Los Angeles Lakers are my favorite",
        },
        // All sentences are the same - no "topic" between prefix and suffix
        new List<string>()
        {
            "These sentences are all the same",
            "These sentences are all the same",
            "These sentences are all the same",
        },
        // Some sentences have no content between prefix and suffix
        new List<string>()
        {
            "This sentence has no topic",
            "This sentence [topic here] has no topic",
            "This sentence has no topic",
            "This sentence [another one] has no topic",
        },
        // First two topics have common beginnings
        new List<string>()
        {
            "The book named Lord of the Rings is a classic",
            "The book named Lord of the Flies is a classic",
            "The book named This is pretty is a classic",
            "The book named War and Peace is a classic",
            "The book named The Three Musketeers is a classic",
        },
        // The first two topics have a common ending
        new List<string>
        {
            "The movie named Matrix 2 is very good",
            "The movie named Avatar 2 is very good",
            "The movie named The Sound of Music is very good",
            "The movie named Terminator 2 is very good",
        }
    };
}

以下是示例用法和输出。我还包括所选答案的结果,以及速度比较的一些性能基准:

private static void Main()
{
    var sentenceLists = GetTestSentences();
    var padLength = sentenceLists.Max(t => t.Max(s => s.Length)) + 2;
    Console.WriteLine("\nComparison Results\n------------------\n");

    // Rufus' solution
    var sw = Stopwatch.StartNew();
    foreach (var sentenceList in sentenceLists)
    {
        var trimmedSentences = RemoveCommonPrefixAndSuffix(sentenceList);

        for (var j = 0; j < trimmedSentences.Count; j++)
        {
            Console.WriteLine("{0} {1}", sentenceList[j].PadRight(padLength, '.'),
                trimmedSentences[j]);
        }

        Console.WriteLine();
    }
    sw.Stop();

    Console.WriteLine($"Rufus' solution took {sw.ElapsedMilliseconds} ms\n");
    Console.WriteLine(new string('-', Console.WindowWidth));

    // Prateek's solution
    sw.Restart();
    foreach (var sentenceList in sentenceLists)
    {
        var prefix = FindMatchingPattern(sentenceList[0], sentenceList[1], true);
        var suffix = FindMatchingPattern(sentenceList[0], sentenceList[1], false);

        if (prefix.Length > 0) prefix = Regex.Escape(prefix);
        if (suffix.Length > 0) suffix = Regex.Escape(suffix);

        foreach (var item in sentenceList)
        {
            var result = Regex.Replace(item, prefix, string.Empty);
            result = Regex.Replace(result, suffix, string.Empty);
            Console.WriteLine($"{item.PadRight(padLength, '.')} {result}");
        }

        Console.WriteLine();
    }
    sw.Stop();

    Console.WriteLine($"Prateek's solution took {sw.ElapsedMilliseconds} ms\n");
    Console.WriteLine(new string('-', Console.WindowWidth));

    GetKeyFromUser("\nDone!! Press any key to exit...");
}

<强>输出

enter image description here