对文本行进行分组,如果A = B且B = C,则A = C.

时间:2014-10-23 19:38:41

标签: python sorting text merge

我目前正在处理解析dig命令的输出。该命令输出规范名称,然后输出最后一条记录的实际IP。

例如,解析dig mail.yahoo.com会执行以下操作:

borrajax@borrajax.kom /tmp/ $ dig @8.8.8.8 @4.2.2.2 +nocomments \
     +noquestion +noauthority +noadditional \
     +nostats +nocmd mail.yahoo.com

mail.yahoo.com.     0   IN  CNAME   login.yahoo.com.
login.yahoo.com.    0   IN  CNAME   ats.login.lgg1.b.yahoo.com.
ats.login.lgg1.b.yahoo.com. 0   IN  CNAME   ats.member.g02.yahoodns.net.
ats.member.g02.yahoodns.net. 0  IN  CNAME   any-ats.member.a02.yahoodns.net.
any-ats.member.a02.yahoodns.net. 49 IN  A   98.139.21.169

所以我希望能够说mail.yahoo.com解析为98.139.21.169,为了做到这一点,我需要"合并" mail.yahoo.com进入login.yahoo.com,然后login.yahoo.com进入ats.login.lgg1.b.yahoo.com ...等等......直到到达最后A条记录。

another question中我已经有了一个很好的正则表达式来解析dig的输出,所以我可以很好地清理这些行并存储到列表中:

[
    ('mail.yahoo.com', 'CNAME', 'login.yahoo.com'),
    ('login.yahoo.com', 'CNAME', 'ats.login.lgg1.b.yahoo.com'),
    ('ats.login.lgg1.b.yahoo.com', 'CNAME', 'ats.member.g02.yahoodns.net'),
    ('ats.member.g02.yahoodns.net', 'CNAME', 'any-ats.member.a02.yahoodns.net'),
    ('any-ats.member.a02.yahoodns.net', 'A', '98.139.21.169')
]

问题是:我怎么能有效地做到这一点,并且以一般的方式,所以如果我在CNAME之间有一些其他的随机线,那么它也会起作用:

[
    ('mail.yahoo.com', 'CNAME', 'login.yahoo.com'),
    ('foo.com', 'CNAME', 'baz.com'),    # Wooops, watch out!
    ('login.yahoo.com', 'CNAME', 'ats.login.lgg1.b.yahoo.com'),
    ('ats.login.lgg1.b.yahoo.com', 'CNAME', 'ats.member.g02.yahoodns.net'),
    ('baz.com', 'A', '204.236.134.199'), # Wooops, watch out!
    ('ats.member.g02.yahoodns.net', 'CNAME', 'any-ats.member.a02.yahoodns.net'),
    ('any-ats.member.a02.yahoodns.net', 'A', '98.139.21.169')
]

所需的输出是:

  • mail.yahoo.com解析为98.139.21.169
  • foo.com解析为204.236.134.199

当然,我可以检查所有CNAMES以及每次我找到它时实际解决的内容,但那会是O(n^2) ......而且它会...可怕。

我确信有更好的方法,但我无法思考。提前感谢任何想法。

2 个答案:

答案 0 :(得分:1)

我会构建一个dict并从那里解析链:

data = [
    ('mail.yahoo.com', 'CNAME', 'login.yahoo.com'),
    ('foo.com', 'CNAME', 'baz.com'),    # Wooops, watch out!
    ('login.yahoo.com', 'CNAME', 'ats.login.lgg1.b.yahoo.com'),
    ('ats.login.lgg1.b.yahoo.com', 'CNAME', 'ats.member.g02.yahoodns.net'),
    ('baz.com', 'A', '204.236.134.199'), # Wooops, watch out!
    ('ats.member.g02.yahoodns.net', 'CNAME', 'any-ats.member.a02.yahoodns.net'),
    ('any-ats.member.a02.yahoodns.net', 'A', '98.139.21.169')
]

data = { t[0]:t[1:] for t in data }

def lookup(host):
    record_type = None
    while record_type != 'A':
        record_type, host = data[host]
    return host

assert lookup('mail.yahoo.com') == '98.139.21.169'
assert lookup('foo.com') == lookup('baz.com') == '204.236.134.199'

答案 1 :(得分:0)

这是我的解决方案(有关算法的更多信息,请参阅注释):

import copy

def resolve(arr):
    # create an index for easy access of the urls
    index = {item[0]: item[2] for item in arr}
    # copy the index 
    mapping = copy.copy(index)

    # loop through the index
    for index_key in index: 
        # get the current value
        value = index[index_key]
        # loop through the mapping as long as the final ip address is reached
        # but only if this url wasn't found before
        while value in mapping:
            # remember the new key (so it can be deleted afterwards)
            key = value
            # get the new value
            value = mapping[key]
            # save the found value as the new value (for later use)
            # this reduces the complexity (-> better performance)
            mapping[index_key] = value
            # delete the "one in the middle" out of the mapping array
            # so that the next item don't have to search for 
            # the correct mapping (because the mapping has been found already)
            del mapping[key]

    return mapping

使用此脚本,无论列表如何排序,您都可以看到它生成相同的输出:

import random

data = [
    ('mail.yahoo.com', 'CNAME', 'login.yahoo.com'),
    ('foo.com', 'CNAME', 'baz.com'),    # Wooops, watch out!
    ('login.yahoo.com', 'CNAME', 'ats.login.lgg1.b.yahoo.com'),
    ('ats.login.lgg1.b.yahoo.com', 'CNAME', 'ats.member.g02.yahoodns.net'),
    ('baz.com', 'A', '204.236.134.199'), # Wooops, watch out!
    ('ats.member.g02.yahoodns.net', 'CNAME', 'any-ats.member.a02.yahoodns.net'),
    ('any-ats.member.a02.yahoodns.net', 'A', '98.139.21.169')
]

# test 50 times
for x in xrange(50):
    # shuffle the data array
    random.shuffle(data)

    print resolve(data)