根据C#中的连词将长乌尔都语句子拆分为较小的句子

时间:2014-03-28 13:46:13

标签: c#

这是我到目前为止所做的。问题是如果连词在句子中出现两次,则代码对于连词的第二次出现不起作用。如果有专家可以提供帮助吗?

    private void SplitSentence_Click(object sender, EventArgs e)
    {
        richTextBox2.Text = "";
        richTextBox3.Text = "";
        string[] keywords = { " or ", " and ", " hence", "so that", "however", " because" };
        string[] sentences = SentenceTokenizer(richTextBox1.Text);
        string remSentence;

        foreach (string sentence in sentences)
        {
           remSentence = sentence;
            richTextBox3.Text = remSentence;
            for (int i =0; i < keywords.Length; i++)
            {
               if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0))
                {

                  richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n';
                  remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length);

                }                   

             }
            richTextBox2.Text += remSentence;
        }
    }

    public static string[] SentenceTokenizer(string text)
    {
        char[] sentdelimiters = new char[] { '.', '?', '۔', '؟', '\r', ':', '-' }; //    '{ ',' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+','|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!','@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'};
        // text.Remove('\n');
        return text.Split(sentdelimiters, StringSplitOptions.RemoveEmptyEntries);
    }

1 个答案:

答案 0 :(得分:1)

您可以使用正则表达式来处理此问题,而不是手动执行操作。我会在我的例子中使用英语,这样我就不会意外地屠杀可怜的乌尔都语。

using System.Text.RegularExpressions;

Regex r = new Regex("\b(and|or|hence)");
sentence = r.Replace(sentence, "|");     // Just something unlikely to be normal.
string[] phrases = sentence.Split ('|'); // Each piece between conjunctions.

您可能需要调整大小写(?)以及结合可能是另一个单词的一部分的可能性(我使用了前导空格 - 或来自@Drahcir建议的单词边界 - 来启动该过程)。有关使用.NET版本的反向引用,请参阅this answer