LINQ根据引号将字符串拆分成句子

时间:2016-01-28 08:06:33

标签: c# linq text

如何将文本分成文本中的句子;带点,问号,惊叹号等。我试图逐句得到每个句子,除了引号内。

例如拆分:

Walked. Turned back. But why? And said "Hello world. Damn this string splitting things!" without a shame.

像这样:

Walked. 
Turned back. 
But why? 
And said "Hello world. Damn this string splitting things!" without a shame.

我正在使用此代码:

 private List<String> FindSentencesWhichContainsWord(string text, string word)
        {
            string[] sentences = text.Split(new char[] { '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries);

            // Define the search terms. This list could also be dynamically populated at runtime.
            string[] wordsToMatch = { word };

            // Find sentences that contain all the terms in the wordsToMatch array.
            // Note that the number of terms to match is not specified at compile time.
            var sentenceQuery = from sentence in sentences
                                let w = sentence.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' },
                                                        StringSplitOptions.RemoveEmptyEntries)
                                where w.Distinct().Intersect(wordsToMatch).Count() == wordsToMatch.Count()
                                select sentence;

            // Execute the query. Note that you can explicitly type
            // the iteration variable here even though sentenceQuery
            // was implicitly typed. 

            List<String> rtn = new List<string>();
            foreach (string str in sentenceQuery)
            {
                rtn.Add(str);
            }
            return rtn;
        }

但是它给出了下面的结果,这不是我想要的。

Walked. 
Turned back. 
But why? 
And said "Hello world.
Damn this string splitting things!
" without a shame.

4 个答案:

答案 0 :(得分:2)

我认为这个问题可以通过两个步骤解决:

  1. 使用TextFieldParser正确识别引用的字词

    string str = "Walked. Turned back. But why? And said \"Hello world. Damn this string splitting things!\" without a shame.";
    string[] words = null;
    using (TextFieldParser parser = new TextFieldParser(new StringReader(str))){
        parser.Delimiters = new string[] { " " };
        parser.HasFieldsEnclosedInQuotes = true;
        words = parser.ReadFields();                
    }    
    
  2. 使用较早的结果根据您所需的特殊行为自定义string的新数组。

    List<string> newWords = new List<string>();
    string accWord = "";
    foreach (string word in words) {
        if (word.Contains(" ")) //means this is multiple items
            accWord += (accWord.Length > 0 ? " " : "") + "\"" + word + "\"";
        else {
            accWord += (accWord.Length > 0 ? " " : "") + word;
            if (word.EndsWith(".") || word.EndsWith("!") || word.EndsWith("?")) {
                newWords.Add(accWord);
                accWord = "";
            }
        }
    }
    
  3. 结果newWords

    [2016-01-28 08:29:48.534 UTC] Walked.
    [2016-01-28 08:29:48.536 UTC] Turned back.
    [2016-01-28 08:29:48.536 UTC] But why?
    [2016-01-28 08:29:48.536 UTC] And said "Hello world. Damn this string splitting things!" without a shame.
    

    如果需要,您可以简单地将这两个包装在返回List<string>

    的单个方法中

答案 1 :(得分:1)

你正在寻找一个名为&#34;句子分割器&#34;的东西。这不是一个微不足道的问题......

如果您对如何正确解决这些问题感兴趣,我推荐这本书&#34;统计自然语言处理基础&#34;来自Manning和Schutze。

为了让您了解这是多么复杂,我将简要介绍我们在Nubilosoft使用的句子拆分器作为搜索组件的一部分。

  • 首先我们进行段落拆分。通过这样做,我们消除了一些明显的错误,并使我们的文本更小。大多数文件格式如MS Word DOC(X)和HTML已经提供了段落标记,这是第一步。
  • 接下来,我们对文本进行特征提取。功能包括标点符号,一些常用缩写(例如&#39; dr。&#39;)和一些上下文信息。
  • 我们确定分裂点。分裂点是标点符号和改变大小写的字符。 (人们有时会忘记标点符号)。
  • 最后,我们将它全部提供给一个感知神经网络,然后判断某些东西是否是一个“分裂”。地点。

这里的所有内容都是在手动注释的语料库中进行训练和测试的;我不记得确切的数字,但它的判决相当多。

通过这样做,它大约99%正确,这是&#34;然后足够好&#34;为了我们的目的。

请注意,语料库的许可是一件非常棘手的事情......在过去,我发现让自己成为一个正常工作的句子分割器的最简单方法就是购买一个已经训练过的人。

答案 2 :(得分:1)

它不是防弹解决方案,但它可以像这样实现。我用手做了句子和引用识别

void Main()
{
    var text = "Walked. Turned back. But why? And said \"Hello world. Damn this string splitting things!\" without a shame.";
    var result = SplitText(text);
}

private static List<String> SplitText(string text)
{
    var result = new List<string>();

    var sentenceEndings = new HashSet<char> { '.', '?', '!' };

    var startIndex = 0;
    var length = 0;

    var isQuote = false;
    for (var i = 0; i < text.Length; i++)
    {
        var c = text[i];
        if (c == '"' && !isQuote)
        {
            isQuote = true;
            continue;
        }

        if (c == '"' && isQuote)
        {
            isQuote = false;
            continue;
        }

        if (!isQuote && sentenceEndings.Contains(c))
        {
            length = i + 1 - startIndex;
            var part = text.Substring(startIndex, length);
            result.Add(part);
            startIndex = i + 2;
        }
    }
    return result;
}

答案 3 :(得分:1)

我使用了TakeWhile。直到角色不是分隔符。或者如果它在引号内。

var seperator = new[] {'.', '?', '!'};

string str =
    @"Walked. Turned back. But why? And said ""Hello world. Damn this string splitting things!"" without a shame.";

List<string> result = new List<string>();
int index = 0;
bool quotes = false;
while (index < str.Length)
{
    var word = str.Skip(index).TakeWhile(ch =>
    {
        index++;
        if (ch == '"') quotes = !quotes;
        return quotes || !seperator.Contains(ch);
    });

    result.Add(string.Join("", word).Trim());
}