Question

我需要从文本旁边删除带有分隔符的单词。我已经删除了单词，但是不知道如何同时删除分隔符。有什么建议吗？
目前，我有：

static void Main(string[] args)
        {
            Program p = new Program();
            string text = "";
            text = p.ReadText("Duomenys.txt", text);
            string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
            char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
            p.DeleteWordsFromText(text, wordsToDelete, separators);
        }

        public string ReadText(string file, string text)
        {     
            text = File.ReadAllText(file);           
            return text;
        }

        public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
        {
            Console.WriteLine(text);
            for (int i = 0; i < wordsToDelete.Length; i++)
            {
                text = Regex.Replace(text, wordsToDelete[i], String.Empty);
            }
            Console.WriteLine("-------------------------------------------");
            Console.WriteLine(text);
        }

结果应为：

how are you?
I am  good.

我有：

, how are you?
, I am . good.

Duomenys.txt

Hello, how are you? 
Thanks, I am kinda. good.

Answer 1

您可以按照以下方式构建正则表达式：

var regex = new Regex(@"\b(" 
    + string.Join("|", wordsToDelete.Select(Regex.Escape)) + ")(" 
    + string.Join("|", separators.Select(c => Regex.Escape(new string(c, 1)))) + ")?");

说明：

开头的\ b与单词边界匹配。以防万一您收到“ XYZThanks”
下一部分将构建与任何wordToDelete匹配的正则表达式构造
最后一部分构建与任何分隔符匹配的regex构造；尾随的“？”之所以存在，是因为您说过如果没有分隔符，也要替换单词

Answer 2

您可以构建像这样的正则表达式

\b(?:Hello|Thanks|kinda)\b[ .,!?:;()    ]*

其中\b(?:Hello|Thanks|kinda)\b将与要删除的所有单词匹配为整个单词，而[ .,!?:;() ]*将与要删除的单词相隔0次或更多次的所有分隔符。

C# solution：

char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
string SepPattern = new String(separators).Replace(@"\", @"\\").Replace("^", @"\^").Replace("-", @"\-").Replace("]", @"\]");
var pattern = $@"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b[{SepPattern}]*";
// => \b(?:Hello|Thanks|kinda)\b[ .,!?:;()  ]*
Regex rx = new Regex(pattern, RegexOptions.Compiled);
// RegexOptions.IgnoreCase can be added to the above flags for case insensitive matching: RegexOptions.IgnoreCase | RegexOptions.Compiled
DeleteWordsFromText("Hello, how are you?", rx);
DeleteWordsFromText("Thanks, I am kinda. good.", rx);

这是DeleteWordsFromText方法：

public static void DeleteWordsFromText(string text, Regex p)
{
    Console.WriteLine($"---- {text} ----");
    Console.WriteLine(p.Replace(text, ""));
}

输出：

---- Hello, how are you? ----
how are you?
---- Thanks, I am kinda. good. ----
I am good.

注释：

string SepPattern = new String(separators).Replace(@"\", @"\\").Replace("^", @"\^").Replace("-", @"\-").Replace("]", @"\]");-这是一种分隔符模式，将在字符类中使用，并且由于仅^，-，\，]个字符需要在字符类中转义，只有这些字符被转义
$@"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b"-这将建立要删除的单词的替代，并且仅将它们作为整个单词进行匹配。

模式详细信息

\b-单词边界
(?:-一个非捕获组的开始：
- Hello-Hello字
- |-或
- Thanks-Thanls字
- |-或
- kinda-kinda字
)-组结束
\b-单词边界
[ .,!?:;() ]*-字符类中的任何0+个字符。

请参见regex demo。

Answer 3

我不会使用正则表达式。从现在开始的3个月内，您将不再对Regex有所了解，并且修复bug很难。

我会使用简单的循环。每个人都会明白：

BinaryHeap

从前面带有分隔符的文本中删除单词（使用正则表达式）

3 个答案: