Question

下面的代码从名为WORDS的文件中读取单词列表，然后使用这些单词并在名为CONTENT的文件中查找它们，然后从CONTENT中删除这些单词并将其替换为######并创建一个一个名为FINAL的新文件-单词文件包含约1.6万行单词，内容文件包含约1.6万行单词，总计约800万个单词-当我运行此文件时，花了1000多分钟才能完成，最终我放弃了。

是否有任何方法可以加快此过程或使用效率更高的方法？ Words中的单词以\ b开头并以\ b结尾-代码确实按我在较小的内容文件上测试的方式工作

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Text.RegularExpressions;


namespace ConsoleApp10
{
    class Program
    {
        static void Main(string[] args)
        {
            string SAR_CONTACTS = @"C:\Users\root\Desktop\WORDS.csv";
            string SAR_CONTENT = @"C:\Users\root\Desktop\CONTENT.csv";
            string READ_SAR_CONTACTS;
            using (StreamReader streamReader = new StreamReader(SAR_CONTENT, Encoding.UTF8))
            READ_SAR_CONTACTS = streamReader.ReadToEnd();

            string SAR_CONTACTS_FILE = File.ReadAllText(SAR_CONTACTS);
            string SAR_CONTENT_FILE = SAR_CONTACTS_FILE.Replace("\r\n", "|");
            SAR_CONTENT_FILE = SAR_CONTENT_FILE.Remove(SAR_CONTENT_FILE.Length - 1);
            string SAR_CONTENT_CENSORED = Regex.Replace(READ_SAR_CONTACTS, SAR_CONTENT_FILE, "######", RegexOptions.IgnoreCase);
            File.WriteAllText(@"C:\Users\root\Desktop\FINAL.csv", SAR_CONTENT_CENSORED);
        }
    }
}

Answer 1

一般来说，我只是将Regex排除在外，因为对于如此庞大的文件，它可能很快变得复杂。使用您的通讯录文件，而不是\b，我可能会用一组分隔符来代替它，例如£&%（如果按顺序使用字面上相同的分隔符字符串，则此分隔符会中断）。

这就是我的写法-请注意，就效率而言，这可能不是最有效的，但它会起作用。还要注意，我添加了Replace的VB版本，因此这种情况被忽略了，因为C＃版本没有这种过载（您也可以编写扩展功能）。

using Microsoft.VisualBasic;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace ConsoleApp5
{
    class Program
    {
        static void Main(string[] args)
        {
            string contacts = @"contacts.csv";
            string content = @"content.csv";
            string[] delimiter = { "£&%" };
            string read_contents;

            using (StreamReader streamReader = new StreamReader(content, Encoding.UTF8))
                read_contents = streamReader.ReadToEnd();

            string sar_contacts = File.ReadAllText(contacts);
            List<string> contactsToReplace = sar_contacts.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).ToList();

            int i = 0;
            foreach (var wordToCensor in contactsToReplace)
            {
                read_contents = Strings.Replace(read_contents, wordToCensor, "######", 1, -1, Constants.vbTextCompare);
                Console.WriteLine(++i); // so we know where we are
            }

            File.WriteAllText(@"filtered.csv", read_contents);
        }
    }
}

正则表达式需要1000多分钟才能完成

1 个答案: