Question

任务：

编写程序，对文本文件中的短语进行计数。任何字符序列都可以作为计数的短语给出，甚至包含分隔符的序列。例如，在“我是索菲亚的学生”一文中，短语“s”，“stu”，“a”和“我是”分别被发现2,1,3和1次。

我知道 string.IndexOf 或 LINQ 的解决方案或某种类型的算法，如 Aho-Corasick 。我想用 Regex 做同样的事情。

这是我到目前为止所做的：

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace CountThePhrasesInATextFile
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = ReadInput("file.txt");
            input.ToLower();
            List<string> phrases = new List<string>();
            using (StreamReader reader = new StreamReader("words.txt"))
            {
                string line = reader.ReadLine();
                while (line != null)
                {
                    phrases.Add(line.Trim());
                    line = reader.ReadLine();
                }
            }
            foreach (string phrase in phrases)
            {
                Regex regex = new Regex(String.Format(".*" + phrase.ToLower() + ".*"));
                int mathes = regex.Matches(input).Count;
                Console.WriteLine(phrase + " ----> " + mathes);
            }
        }

        private static string ReadInput(string fileName)
        {
            string output;
            using (StreamReader reader = new StreamReader(fileName))
            {
                output  = reader.ReadToEnd();
            }
            return output;
        }
    }
}

我知道我的正则表达式不正确，但我不知道要改变什么。

输出：

Word ----> 2
S ----> 2
MissingWord ----> 0
DS ----> 2
aa ----> 0

正确的输出：

Word --> 9
S --> 13
MissingWord --> 0
DS --> 2
aa --> 3

file.txt包含：

Word? We have few words: first word, second word, third word.
Some passwords: PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD

words.txt包含：

Word
S
MissingWord
DS
aa

Answer 1

这是发生了什么。我将以Word为例。

你为＃34; word＆＃34;建立的正则表达式是＆＃34;。字。＆＃34;。它告诉正则表达式匹配任何开头的东西，包含＆＃34; word＆＃34;什么都结束了。

输入

，匹配

字？我们几句话：第一个字，第二个字，第三个字。

以"Word? We have few words: first"开头，以", second word, third word."

结尾

然后第二行以"Some pass"包含"word"开头，以": PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD"结尾

所以计数是2

你想要的正则表达式很简单，字符串"word"就足够了。

更新

for ignore case pattern try "(?i)word"

对于AAaA内的多个匹配，请尝试"(?i)(?<=a)a"

?<=是一个零宽度正向后观断言

Answer 2

您需要先发布file.txt内容，否则很难验证正则表达式是否正常工作。

话虽如此，请查看正则表达式的答案： Finding ALL positions of a substring in a large string in C# 并查看这是否有助于您的代码。

编辑：

所以有一个简单的解决方案，为每个短语添加“（？=（”和“））”。这是正则表达式中的前瞻性断言。以下代码处理您想要的内容。

        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }

您还遇到了

的问题

input.ToLower();

应该是

input = input.ToLower();

因为c＃中的字符串是不可变的。总的来说，您的代码应该是：

    static void Main(string[] args) {
        string input = ReadInput("file.txt");
        input = input.ToLower();
        List<string> phrases = new List<string>();
        using (StreamReader reader = new StreamReader("words.txt")) {
            string line = reader.ReadLine();
            while (line != null) {
                phrases.Add(line.Trim());
                line = reader.ReadLine();
            }
        }
        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }
        Thread.Sleep(50000);
    }

    private static string ReadInput(string fileName) {
        string output;
        using (StreamReader reader = new StreamReader(fileName)) {
            output = reader.ReadToEnd();
        }
        return output;
    }

使用Regex C＃在文本文件中搜索一些短语

2 个答案: