合并两个文本文件删除重复项

时间:2016-06-28 14:20:10

标签: c#

我有 2个文本文件,如下所示(像1466786391这样的大数字是唯一的时间戳):

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

....

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391

和此:

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
PING 10.0.0.6 (10.0.0.6): 56 data byte

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 44 packets received, 12% packet loss
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
PING 10.0.0.6 (10.0.0.6): 56 data bytes
....

所以第一个文件以timestamp 1466786391 结束,第二个文件在中间某处有相同的数据块,之后有更多数据,特定时间戳之前的数据完全相同作为第一个文件。

所以我想要的输出是:

--- 10.0.0.6 ping statistics ---
    50 packets transmitted, 49 packets received, 2% packet loss
    round-trip min/avg/max = 20.917/70.216/147.258 ms
    1466786342
    PING 10.0.0.6 (10.0.0.6): 56 data bytes

    ....

    --- 10.0.0.6 ping statistics ---
    50 packets transmitted, 50 packets received, 0% packet loss
    round-trip min/avg/max = 29.535/65.768/126.983 ms
    1466786391

 --- 10.0.0.6 ping statistics ---
    50 packets transmitted, 44 packets received, 12% packet loss
    round-trip min/avg/max = 30.238/62.772/102.959 ms
    1466786442
    PING 10.0.0.6 (10.0.0.6): 56 data bytes
....

即,连接这两个文件,并创建第三个文件,删除第二个文件的副本(第一个文件中已存在的文本块。这是我的代码:

public static void UnionFiles()
{ 

    string folderPath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat");
    var union = Enumerable.Empty<string>();

    foreach (string filePath in Directory
                .EnumerateFiles(folderPath, "*.txt")
                .OrderBy(x => Path.GetFileNameWithoutExtension(x)))
    {
        union = union.Union(File.ReadAllLines(filePath));
    }
    File.WriteAllLines(outputFilePath, union);
}

这是我得到的错误输出(文件结构被破坏):

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
round-trip min/avg/max = 5.475/40.986/96.964 ms
1466786492
round-trip min/avg/max = 5.276/61.309/112.530 ms
编辑:编写此代码是为了处理多个文件,但即使只有2个文件可以正确完成,我也很高兴。

然而,这并没有像它应该删除textblocks,它删除了几个有用的行并使输出完全没用。我被卡住了。

如何实现这一目标? 感谢。

3 个答案:

答案 0 :(得分:3)

我认为你想要比较块,而不是每行的线。

这样的事情应该有效:

public static void UnionFiles()
{
    var firstFilePath = "log1.txt";
    var secondFilePath = "log2.txt";

    var firstLogBlocks = ReadFileAsLogBlocks(firstFilePath);
    var secondLogBlocks = ReadFileAsLogBlocks(secondFilePath);

    var cleanLogBlock = firstLogBlocks.Union(secondLogBlocks);

    var cleanLog = new StringBuilder();
    foreach (var block in cleanLogBlock)
    {
        cleanLog.Append(block);
    }

    File.WriteAllText("cleanLog.txt", cleanLog.ToString());
}

private static List<LogBlock> ReadFileAsLogBlocks(string filePath)
{
    var allLinesLog = File.ReadAllLines(filePath);

    var logBlocks = new List<LogBlock>();
    var currentBlock = new List<string>();

    var i = 0;
    foreach (var line in allLinesLog)
    {
        if (!string.IsNullOrEmpty(line))
        {
            currentBlock.Add(line);
            if (i == 4)
            {
                logBlocks.Add(new LogBlock(currentBlock.ToArray()));
                currentBlock.Clear();
                i = 0;
            }
            else
            {
                i++;
            }
        }
    }

    return logBlocks;
}

使用日志块定义如下:

public class LogBlock
{
    private readonly string[] _logs;

    public LogBlock(string[] logs)
    {
        _logs = logs;
    }

    public override string ToString()
    {
        var logBlock = new StringBuilder();
        foreach (var log in _logs)
        {
            logBlock.AppendLine(log);
        }

        return logBlock.ToString();
    }

    public override bool Equals(object obj)
    {
        return obj is LogBlock && Equals((LogBlock)obj);
    }

    private bool Equals(LogBlock other)
    {
        return _logs.SequenceEqual(other._logs);
    }

    public override int GetHashCode()
    {
        var hashCode = 0;
        foreach (var log in _logs)
        {
            hashCode += log.GetHashCode();
        }
        return hashCode;
    }
}

请注意在LogBlock中重写Equals并使用一致的GetHashCode实现,因为Union使用它们,如here所述。

答案 1 :(得分:1)

使用正则表达式的一个相当hacky的解决方案:

var logBlockPattern = new Regex(@"(^---.*ping statistics ---$)\s+"
                              + @"(^.+packets transmitted.+packets received.+packet loss$)\s+"
                              + @"(^round-trip min/avg/max.+$)\s+"
                              + @"(^\d+$)\s*"
                              + @"(^PING.+$)?",
                                RegexOptions.Multiline);

var logBlocks1 = logBlockPattern.Matches(FileContent1).Cast<Match>().ToList();
var logBlocks2 = logBlockPattern.Matches(FileContent2).Cast<Match>().ToList();

var mergedLogBlocks = logBlocks1.Concat(logBlocks2.Where(lb2 => 
    logBlocks1.All(lb1 => lb1.Groups[4].Value != lb2.Groups[4].Value)));

var mergedLogContents = string.Join("\n\n", mergedLogBlocks);

正则表达式Groups的{​​{1}}集合包含日志块的每一行(因为在模式中每行包含在parantheses Match中)并且索引{{ 1}}。因此,索引为()的匹配组是我们可用于比较日志块的时间戳。

工作示例:https://dotnetfiddle.net/kAkGll

答案 2 :(得分:-2)

在查找唯一记录时存在问题。 你能检查下面的代码吗?

public static void UnionFiles()
{ 

    string folderPath =     Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat");
    var union =new List<string>();

    foreach (string filePath in Directory
            .EnumerateFiles(folderPath, "*.txt")
            .OrderBy(x => Path.GetFileNameWithoutExtension(x)))
    {
         var filter = File.ReadAllLines(filePath).Where(x => !union.Contains(x)).ToList();
    union.AddRange(filter);

    }
    File.WriteAllLines(outputFilePath, union);
}