Question

我正在处理从数据库中提取的非常大的数据文件。我需要删除这些文件中的重复项。如果存在重复项，则它们将存在于不在同一文件中的文件中。这些文件包含如下所示的条目：

File1

 623898/bn-oopi-990iu/I Like Potato
 982347/ki-jkhi-767ho/Let's go to Sesame Street
 ....


File2

 568798/jj-ytut-786hh/Hello Mike
 982347/ki-jkhi-767ho/Let's go to Sesame Street
 ....

Answer 1

扩展您最初的想法：

HostFactory.Run(x =>
        {
            x.UseNLog();
            x.Service<MyService>(sc =>
            {
                sc.ConstructUsing(hs => new MyService(hs));
                sc.WhenStarted((s, h) => s.Start(h));
                sc.WhenStopped((s, h) => s.Stop(h));
            });

            x.AfterInstall(s =>
            {
                using (var system = Registry.LocalMachine.OpenSubKey("SYSTEM"))
                using (var controlSet = system.OpenSubKey("CurrentControlSet"))
                using (var services = controlSet.OpenSubKey("services"))
                using (var service = services.OpenSubKey(string.IsNullOrEmpty(s.InstanceName)
                    ? s.ServiceName
                    : s.ServiceName + "$" + s.InstanceName, true))
                {

                    if (service == null)
                        return;

                    var imagePath = service.GetValue("ImagePath") as string;

                    if (string.IsNullOrEmpty(imagePath))
                        return;

                        var appendix = string.Format(" -{0} \"{1}\"", "config", "C:\i00config.json"); //only a test to see if it is possible at all or not
                        imagePath = imagePath + appendix;


                    service.SetValue("ImagePath", imagePath);
                }
            });

            x.SetServiceName("MyService");
            x.SetDisplayName("My Service");
            x.SetDescription("My Service Sample");
            x.StartAutomatically();
            x.RunAsLocalSystem();
            x.EnableServiceRecovery(r =>
            {
                r.OnCrashOnly();
                r.RestartService(1); //first
                r.RestartService(1); //second
                r.RestartService(1); //subsequents
                r.SetResetPeriod(0);
            });
        });

即。形成输出，只打印重复的字符串，然后搜索所有文件（从sort * | uniq -cd | awk '{print $2}' | grep -Ff- *采取的事物列表，即stdin），字面意思（-）。

Answer 2

这些方面的内容可能很有用：

awk '!seen[$0] { print $0 > FILENAME ".new" } { seen[$0] = 1 }' file1 file2 file3 ...

Answer 3

twalberg的解决方案工作得很好但是如果你的文件非常大，它可能耗尽可用内存，因为它会在每个遇到的唯一记录的关联数组中创建一个条目。如果它发生了，你可以尝试类似的方法，每个重复记录只有一个条目（我假设你有GNU awk，你的文件名为* .txt）：

sort *.txt | uniq -d > dup
awk 'BEGIN {while(getline < "dup") {dup[$0] = 1}} \
!($0 in dup) {print >> (FILENAME ".new")} \
$0 in dup {if(dup[$0] == 1) {print >> (FILENAME ".new");dup[$0] = 0}}' *.txt

请注意，如果您有许多重复项，它也可能耗尽可用内存。您可以通过将dup文件拆分为较小的块并在每个块上运行awk脚本来解决此问题。

在bash中查找非常大的文本文件中的重复条目

3 个答案: