Question

我有一个大字符串，其中可以有多个特定单词（文本后跟单个冒号，如“test：”）不止一次出现。例如，像这样：

word:
TEST:
word:

TEST:
TEST: // random text

“word”出现两次，“TEST”出现三次，但数量可以变化。此外，这些单词不必具有相同的顺序，并且在单词的同一行中可以有更多的文本（如“TEST”的最后一个示例所示）。我需要做的是将出现次数附加到每个单词，例如输出字符串必须是：

word_ONE:
TEST_ONE:
word_TWO:

TEST_TWO:
TEST_THREE: // random text

用于获取我写的这些单词的RegEx是^\b[A-Za-z0-9_]{4,}\b:。但是，我不知道如何快速完成上述工作。有什么想法吗？

Answer 1

正则表达式非常适合这项工作 - 使用替换匹配评估器：

此示例未经过测试或编译：

public class Fix
{
    public static String Execute(string largeText)
    {
        return Regex.Replace(largeText, "^(\w{4,}):", new Fix().Evaluator);
    }

    private Dictionary<String, int> counters = new Dictionary<String, int>();
    private static String[] numbers = {"ONE", "TWO", "THREE",...};
    public String Evaluator(Match m)
    {
        String word = m.Groups[1].Value;
        int count;
        if (!counters.TryGetValue(word, out count))
          count = 0;
        count++;
        counters[word] = count;

        return word + "_" + numbers[count-1] + ":";
    }
}

这应该返回您在致电时要求的内容：

result = Fix.Execute(largeText);

Answer 2

如果我理解正确的话，这里没有必要使用正则表达式。

您可以按':'字符拆分大字符串。也许你还需要逐行阅读（由'\n'分割）。之后，您只需创建一个字典（IDictionary<string, int>），它会计算某些单词的出现次数。每次找到单词x时，都会增加字典中的计数器。

修改

逐行读取文件或按'\n'
拆分字符串
检查您的分隔符是否存在。通过':'或使用正则表达式进行拆分。

获取分组数组中的第一项或正则表达式的第一项匹配。

使用字典计算您的出现次数。

if (dictionary.Contains(key)) dictionary[key]++;
else dictionary.Add(key, 1);

如果你需要单词而不是数字，那么为这些创建另一个字典。如果密钥等于dictionary[key]，则one等于1。 Mabye还有另一种解决方案。

Answer 3

我认为您可以使用Regex.Replace（string，string，MatchEvaluator）和字典来完成此操作。

Dictionary<string, int> wordCount=new Dictionary<string,int>();
string AppendIndex(Match m)
{
   string matchedString = m.ToString();
   if(wordCount.Contains(matchedString))
     wordCount[matchedString]=wordCount[matchedString]+1;
   else
     wordCount.Add(matchedString, 1);
  return matchedString + "_"+ wordCount.ToString();// in the format: word_1, word_2
}


string inputText = "....";
string regexText = @"";

   static void Main() 
   {
      string text = "....";
      string result = Regex.Replace(text, @"^\b[A-Za-z0-9_]{4,}\b:",
         new MatchEvaluator(AppendIndex));
   }

看到这个： http://msdn.microsoft.com/en-US/library/cft8645c(v=VS.80).aspx

Answer 4

看看这个例子（我知道它并不完美，不太好）让我们保留Split函数的确切参数，我认为它可以帮助

static void Main(string[] args)
{
  string a = "word:word:test:-1+234=567:test:test:";
  string[] tks = a.Split(':');
  Regex re = new Regex(@"^\b[A-Za-z0-9_]{4,}\b");
  var res = from x in tks
  where re.Matches(x).Count > 0
  select x + DecodeNO(tks.Count(y=>y.Equals(x)));
  foreach (var item in res)
  {
    Console.WriteLine(item);
  }
  Console.ReadLine();
}

private static string DecodeNO(int n)
{
 switch (n)
 {
   case 1:
     return "_one";
   case 2:
     return "_two";
   case 3:
     return "_three";
  }
 return "";
}

查找特定格式的出现次数字符串在给定文本中

4 个答案: