关于选择复杂排序的数据结构的建议

时间:2013-02-28 03:24:55

标签: algorithm perl sorting

我正在寻找一些帮助编写一些Perl代码来对日志文件进行排序。

我是编码和perl的相对新手!

我需要尽可能地使用核心perl模块编写我的代码,但如果事实证明这是不可能的,那么我对CPAN模块开放。日志文件包含记录消息的列表,需要按顺序重新排列。应该很简单,但有很多陷阱,这使我在如何设计我的数据结构方面遇到麻烦。输入文件格式为CSV,输出需要与时间戳顺序中的消息相同,并且连接的消息首先与第一个消息部分组合在一起。

陷阱

  1. 消息需要按时间戳排序。
  2. 如果邮件已分成多行,则在最后一个字段中将显示以下内容“(消息引用1的第1部分,共3部分)”。对于特定的消息引用,所有部分都需要按顺序排列,因此第1部分,第2部分,第3部分,等等。
  3. 此字段开头的十六进制数字告诉我它是8位还是16位参考,具有相同参考编号的8位参考与具有相同编号的16位参考不匹配(作为重复) 。所以我需要考虑到这一点。
  4. 消息部分可能会丢失,因此我们可能只会获得3个部分中的第1部分和第2部分。
  5. 可能存在重复的邮件引用号,因此每个邮件引用都需要绑定到from字段以赋予其唯一标识。
  6. 即使使用(3)中的唯一标识,仍然可以随时间重复(因为在重置之前只有很多消息引用号),所以我需要检查最后一个部分的重复时间消息参考。如果消息部分之间的时间超过3天,那么我可以将其视为新消息。
  7. 最后,日志文件中可能有数十万行需要重新排序,因此将其全部加载到内存中可能不是一种选择。
  8. 如果我只是输入一些示例输入数据,然后它是如何出来的话,那可能是最好的。

    输入数据

    #message uniqueID,From,To,Time,flag,content,IP,concatenation info   
    1,"+1231231234","+15125562100","7 Sep 2012 22:08:33","","abcdefghijklmnopqrstuvwxyz",,
    2,"+1231231234","+15125562100","7 Sep 2012 22:08:37","","abcdefghijklmnopqrstuvwxyz",,
    3,"+1231231234","+15125562100","7 Sep 2012 22:08:41","","abcdefghijklmnopqrstuvwxyz",,
    4,"+8888888888","+15125562100","7 Sep 2012 22:09:01","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
    5,"+8888888888","+15125562100","7 Sep 2012 22:09:04","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
    6,"+8888888888","+15125562100","7 Sep 2012 22:09:05","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
    7,"+8888888888","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
    8,"+8888888888","+15125562100","7 Sep 2012 22:09:07",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
    10,"+1231231234","+15125562100","7 Sep 2012 22:09:46","","abcdefghijklmnopqrstuvwxyz",,
    11,"+1231231234","+15125562100","7 Sep 2012 22:09:50","","abcdefghijklmnopqrstuvwxyz",,
    12,"+1231231234","+15125562100","7 Sep 2012 22:09:55","","abcdefghijklmnopqrstuvwxyz",,
    13,"+8888888888","+15125562100","13 Sep 2012 22:10:36","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
    14,"+8888888888","+15125562100","13 Sep 2012 22:10:38","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
    15,"+8888888888","+15125562100","13 Sep 2012 22:10:39","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
    16,"+8888888889","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
    17,"+8888888889","+15125562100","7 Sep 2012 22:10:42",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
    18,"+8888888889","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIDAw==  (part 3 of 3 of message reference 2)"
    19,"+1231231234","+15125562100","13 Sep 2012 20:12:52","","Deposit SMS with readreceiptrequest = false #0",,
    20,"+1231231234","+15125562100","13 Sep 2012 20:12:53","","Deposit SMS with readreceiptrequest = false #1",,
    21,"+1231231234","+15125562100","13 Sep 2012 20:12:54","","Deposit SMS with readreceiptrequest = false #2",,
    22,"+8888888888","+15125562100","13 Sep 2012 20:12:55","","Deposit SMS with readreceiptrequest = false #0: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAAMB  (part 1 of 3 of message reference 0)"
    23,"+8888888888","+15125562100","13 Sep 2012 20:12:57","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAAMC  (part 2 of 3 of message reference 0)"
    24,"+8888888888","+15125562100","13 Sep 2012 20:12:58","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAAMD  (part 3 of 3 of message reference 0)"
    25,"+8888888888","+15125562100","7 Sep 2012 22:10:40","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
    26,"+8888888888","+15125562100","7 Sep 2012 22:10:42","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
    27,"+8888888888","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIEAw==  (part 2 of 2 of message reference 3)"
    28,"+8888888888","+15125562100","13 Sep 2012 20:13:02","","Deposit SMS with readreceiptrequest = false #2: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAgMB  (part 1 of 3 of message reference 2)"
    29,"+8888888888","+15125562100","13 Sep 2012 20:13:03","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAgMC  (part 2 of 3 of message reference 2)"
    30,"+8888888888","+15125562100","13 Sep 2012 20:13:04","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAgMD  (part 3 of 3 of message reference 2)"
    31,"+1231231234","+15125562100","13 Sep 2012 20:13:08","","Deposit SMS with readreceiptrequest = true #0",  
    

    输出数据

    #message uniqueID,From,To,Time,flag,content,IP,concatenation info   
    1,"+1231231234","+15125562100","7 Sep 2012 22:08:33","","abcdefghijklmnopqrstuvwxyz",,
    2,"+1231231234","+15125562100","7 Sep 2012 22:08:37","","abcdefghijklmnopqrstuvwxyz",,
    3,"+1231231234","+15125562100","7 Sep 2012 22:08:41","","abcdefghijklmnopqrstuvwxyz",,
    4,"+8888888888","+15125562100","7 Sep 2012 22:09:01","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
    5,"+8888888888","+15125562100","7 Sep 2012 22:09:04","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
    6,"+8888888888","+15125562100","7 Sep 2012 22:09:05","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
    16,"+8888888889","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
    17,"+8888888889","+15125562100","7 Sep 2012 22:10:42",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
    18,"+8888888889","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIDAw==  (part 3 of 3 of message reference 2)"
    7,"+8888888888","+15125562100","7 Sep 2012 22:09:06","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIDAQ==  (part 1 of 3 of message reference 2)"
    8,"+8888888888","+15125562100","7 Sep 2012 22:09:07",""," my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall p",,"BggEAAIDAg==  (part 2 of 3 of message reference 2)"
    10,"+1231231234","+15125562100","7 Sep 2012 22:09:46","","abcdefghijklmnopqrstuvwxyz",,
    11,"+1231231234","+15125562100","7 Sep 2012 22:09:50","","abcdefghijklmnopqrstuvwxyz",,
    12,"+1231231234","+15125562100","7 Sep 2012 22:09:55","","abcdefghijklmnopqrstuvwxyz",,
    25,"+8888888888","+15125562100","7 Sep 2012 22:10:40","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
    26,"+8888888888","+15125562100","7 Sep 2012 22:10:42","","LONGUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wit",,"BggEAAIEAQ==  (part 1 of 2 of message reference 3)"
    27,"+8888888888","+15125562100","7 Sep 2012 22:10:43","","ess, ah, nevermore!",,"BggEAAIEAw==  (part 2 of 2 of message reference 3)"
    19,"+1231231234","+15125562100","13 Sep 2012 20:12:52","","Deposit SMS with readreceiptrequest = false #0",,
    20,"+1231231234","+15125562100","13 Sep 2012 20:12:53","","Deposit SMS with readreceiptrequest = false #1",,
    21,"+1231231234","+15125562100","13 Sep 2012 20:12:54","","Deposit SMS with readreceiptrequest = false #2",,
    22,"+8888888888","+15125562100","13 Sep 2012 20:12:55","","Deposit SMS with readreceiptrequest = false #0: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAAMB  (part 1 of 3 of message reference 0)"
    23,"+8888888888","+15125562100","13 Sep 2012 20:12:57","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAAMC  (part 2 of 3 of message reference 0)"
    24,"+8888888888","+15125562100","13 Sep 2012 20:12:58","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAAMD  (part 3 of 3 of message reference 0)"
    28,"+8888888888","+15125562100","13 Sep 2012 20:13:02","","Deposit SMS with readreceiptrequest = false #2: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms ",,"BQADAgMB  (part 1 of 3 of message reference 2)"
    29,"+8888888888","+15125562100","13 Sep 2012 20:13:03","","ore; This and more I sat divining, with my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with",,"BQADAgMC  (part 2 of 3 of message reference 2)"
    30,"+8888888888","+15125562100","13 Sep 2012 20:13:04","","the lamplight gloating oer She shall press, ah, nevermore!",,"BQADAgMD  (part 3 of 3 of message reference 2)"
    31,"+1231231234","+15125562100","13 Sep 2012 20:13:08","","Deposit SMS with readreceiptrequest = true #0",
    13,"+8888888888","+15125562100","13 Sep 2012 22:10:36","","SHORTUDH: Thus I sat engaged in guessing, but no syllable expressing To the fowl, whose fiery eyes now burned into my bosoms core; This and more I sat divining, wi",,"BQADAQMB  (part 1 of 3 of message reference 1)"
    14,"+8888888888","+15125562100","13 Sep 2012 22:10:38","","h my head at ease reclining On the cushions velvet lining that the lamplight gloated oer, But whose velvet violet lining with the lamplight gloating oer She shall ",,"BQADAQMC  (part 2 of 3 of message reference 1)"
    15,"+8888888888","+15125562100","13 Sep 2012 22:10:39","","ress, ah, nevermore!",,"BQADAQMD  (part 3 of 3 of message reference 1)"
    

    到目前为止,我所做的事情是

    1. 将时间字段转换为纪元时间以使任何比较更容易
    2. 可以读入(并写出文​​件)。
    3. 可以解析所有CSV列。
    4. 可以将串联信息拆分为其部分,即8位或16位参考,部件号,总数和参考ID。
    5. 现在我陷入困境,想出了有效过滤和排序数据的最佳方法。我已尝试使用哈希并首先将文件加载到内存中,以便我可以对特定的消息引用进行排序,但我不确定它是否适用于大文件。

      然后我考虑逐行阅读它,但我可能遇到第二行包含连接SMS的第一部分的问题,我们可能直到文件的最后才会得到后续部分,所以我想也许这也不是一个好主意。

      我还想过一个数据库,但我认为在需要运行的系统上进行设置太复杂了。另一种选择是编写包并将复杂结构存储为对象?也许我过于复杂化了?我的大脑肯定会变得糊涂!

      无论如何,任何想法或指导都会非常感激。

      希望以上内容很清楚,但如果您有任何疑问,请与我联系。

      谢谢, 将

2 个答案:

答案 0 :(得分:2)

如果正确分解,我认为这个问题太复杂了。

在我看来,您的分拣程序将包含以下阶段:

  1. 从每一行中提取相关信息(时间戳和连续信息)。
  2. 通过消息引用对行进行分组,这可以通过缓存以内存效率完成。
  3. 按时间戳对组进行排序。
  4. 将组展平为原始行。
  5. Schwartzian变换

    在Perl中排序时,Schwartzian是一种常见模式。它通过提取一次数据而不是每次比较来加速排序索引必须从实际排序的数据中提取的排序。它也可以被描述为decorate-sort-undecorate。

    示例:按长度排序字符串。请注意,在这种情况下,天真的实现会更好。

    my @words = qw( aaa b cccc );
    my @sorted_words = 
        map  { $_->[1]             } # flatten
        sort { $a->[0] <=> $b->[0] } # sort by first field (length)
        map  { [ length $_, $_ ]   } # decorate: return arrayref with key and data
        @words;
    print "[@sorted_words]\n"; # prints "[b aaa cccc]"
    

    将这种模式牢记于你的任务

    会很好

    1。提取

    你已经成功了。对于每一行,我们输出一个数组引用或类似的字段:

    0: timestamp (in epoch)
    1: part no            \
    2: total parts        | these are undef if no concat info is present
    3: message reference  /
    4: The unmodifed line
    

    对于CSV提取,您应该使用Text::CSV来计算时期,您应该查看DateTime

    2。分组

    我们以散列形式定义缓存,其中消息引用为键,组为值。组是一个arrayref作为上面指定的提取格式,但可以包含位置5和向前的其他行(即每个标记的行是一个组)。

    对于收到的每个标记行,我们执行以下步骤:

    # pseudocode
    # this is how I understood your requirements,
    # but it may be wrong. The general principle still holds
    # (you may need to choose a different key)
    IF the line doesn't have part information, THEN
        pass it on immediately.
    ELSE
        IF the hash has an entry for our message reference, THEN
            IF the timestamp of the present group is too old, THEN
                pass on the existing group.
                Add our line for this key.
            ELSE
                Update the group with our line,
                adding the original line (at position 3 + part no),
                but not the metadata to the group.
                IF the group is made complete, THEN
                    pass it on immediately,
                    delete this entry from the hash.
        ELSE
            Add the line as a group.
            Make sure the content is at position 3 + part no, to allow easy updating.
    

    在没有新行之后,我们将散列中的每个剩余值传递到下一个阶段。

    要认识到的重要一点是,您不必在此处将所有行保留在内存中,而只需保留不完整的组。

    有趣的Perl函数是exists $hash{element}delete $hash{element}delete对于节省内存可能很重要。

    3。排序

    我们只是按时间戳对每个元素进行排序。如果系统要处理的总数据太多,我们可以使用一个技巧:

    1. 对较小的数据块进行排序,将这些数据保存到文件中。
    2. 打开每个文件。
    3. 加载每个文件中的第一项
    4. Do-While至少有一个文件剩下的项目:
      1. 对所有已加载的项目进行排序
      2. 传递第一个结果元素。
      3. 从当前第一个元素来自
      4. 的文件中加载下一个项目
    5. 以正确的顺序传递其他(已加载的)项目
    6. 然而,这是耗时的。

      4。平坦化

      在这里,我们只接收已排序和分组的项目。我们所要做的就是以正确的顺序输出所包含的行。

答案 1 :(得分:0)

我会分两个阶段完成:组合消息部分和排序。这应该会在一定程度上简化问题。

首先,我将使用外部排序实用程序(例如,GNU排序工具)按消息编号进行排序。这将至少将具有相同消息编号的所有部分组合在一起。一个简单的sort <inputfile >outputfile将满足您的需求。你真正感兴趣的是让所有部分开始,例如,371,"...彼此相邻。

然后,您可以编写Perl程序来读取输出并累积具有相同消息编号的行。当您看到不同的消息编号时,过滤您累积的行以组合来自不同部分的消息。并将该记录写入文件。您可能希望以更容易排序的形式编写输出。也许通过输出您在记录前面排序的字段,必要时填零,以简化排序。

完成后,你有一个文件,每行包含一个记录,如果你正确构建了记录,你可以再做一个sort <inputfile >outputfile来按照你想要的顺序获取数据。

这也简化了您的编程:您不必担心为数据编写自定义排序。相反,您编写了一个相对简单的Perl程序来转换数据,以便更容易地按现有工具进行排序。