Question

我有一个问题我可以使用一些帮助，我有这样的python列表：

fail = [
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java']
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt']
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt']
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py']
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', 'svin.txt']
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', 'apa2.txt']
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt']

sha1 value, directory, filename

我想要的是根据sha1值和目录将这些内容分成两个不同的列表。例如。

['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt']
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt']

我想添加到列表duplicate = []，因为它位于具有相同sha1值的同一目录中（并且只有该目录）。我希望将其余条目添加到另一个列表，例如diff = []，因为sha1值相同但目录不同。

我有点迷失在这里的逻辑，所以我能得到的所有帮助都会感激不尽！

编辑：修正了一个拼写错误，最后一个值（文件名）在某些情况下是一个1列表元素，这是100％不正确的，可以让SilentGhost知道这个问题。

Answer 1

duplicate = []
# Sort the list so we can compare adjacent values
fail.sort()
#if you didn't want to modify the list in place you can use:
#sortedFail = sorted(fail)
#      and then use sortedFail in the rest of the code instead of fail
for i, x in enumerate(fail):
    if i+1 == len(fail):
        #end of the list
        break
    if x[:2] == fail[i+1][:2]:
        if x not in duplicate:
            duplicate.add(x)
        if fail[i+1] not in duplicate:
            duplicate.add(fail[i+1])
# diff is just anything not in duplicate as far as I can tell from the explanation
diff = [d for d in fail if d not in duplicate]

使用您的示例输入

duplicate: [
              ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']], 
              ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt']
           ]

diff: [
          ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']], 
          ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'], 
          ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']],
          ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'],
          ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py']
      ]

所以也许我错过了一些东西，但我认为这就是你所要求的。

Answer 2

您可以简单地循环遍历所有值，然后使用内部循环来比较目录，然后如果目录是相同的比较值，则分配列表。这会给你一个不错的n ^ 2算法来解决它。

也许就像这个未经测试的代码：

>>>for i in range(len(fail)-1):
...   dir = fail[i][1]
...   sha1 = fail[i][0]
...   for j in range(i+1,len(fail)):
...      if dir == fail[j][1]: #is this how you compare strings?
...         if sha1 == fail[j][0]:
...            #remove from fail and add to duplicate and add other to diff

代码再次未经测试。

Answer 3

在下面的代码示例中，我使用基于SHA1和目录名称的密钥来检测唯一和重复的条目以及用于管理的备用字典。

# Test dataset
fail = [
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'],
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt'],
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'],
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'],
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']],
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']],
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']],
]


def sort_duplicates(filelist):
    """Returns a tuplie whose first element is a list of unique files,
    and second element is a list of duplicate files.
    """
    diff = []
    diff_d = {}

    duplicate = []
    duplicate_d = {}

    for entry in filelist:

        # Make an immutable key based on the SHA-1 and directory strings
        key = (entry[0], entry[1])

        # If this entry is a known duplicate, add it to the duplicate list
        if key in duplicate_d:
            duplicate.append(entry)

        # If this entry is a new duplicate, add it to the duplicate list
        elif key in diff_d:
            duplicate.append(entry)
            duplicate_d[key] = entry

            # And relocate the matching entry to the duplicate list
            matching_entry = diff_d[key]
            duplicate.append(matching_entry)
            duplicate_d[key] = matching_entry
            del diff_d[key]
            diff.remove(matching_entry)

        # Otherwise add this entry to the different list
        else:
            diff.append(entry)
            diff_d[key] = entry

    return (diff, duplicate)

def test():
    global fail
    diff, dups = sort_duplicates(fail)
    print "Diff:", diff
    print "Dups:", dups

test()

Answer 4

这是使用词典按sha和目录分组的另一种方法。这也消除了文件名中的随机列表。

new_fail = {}     # {sha: {dir: [filenames]}}
for item in fail:
    # split data into it's parts
    sha, directory, filename = item

    # make sure the correct elements exist in the data structure
    if sha not in new_fail:
        new_fail[sha] = {}
    if directory not in new_fail[sha]:
        new_fail[sha][directory] = []

    # this is where the lists are removed from the file names
    if type(filename) == type([]):
        filename = filename[0]

    new_fail[sha][directory].append(filename)

diff = []
dup = []

# loop through the data, analyzing it
for sha, val in new_fail.iteritems():
    for directory, filenames in val.iteritems():

        # check to see if the sha/dir combo has more than one file name
        if len(filenames) > 1:
            for filename in filenames:
                dup.append([sha, directory, filename])
        else:
            diff.append([sha, dir, filenames[0]])

要打印它：

print 'diff:'
for i in diff:
    print i
print '\ndup:'
for i in dup:
    print i

示例数据如下所示：

diff:
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py']
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java']
['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', 'svin.txt']
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', 'apa2.txt']
['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt']

dup:
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt']
['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'apa.txt']

Answer 5

我相信接受的答案会更有效率（Python的内部排序应该比我的字典行走更快），但是因为我已经想出了这个，所以我也可以发布它。： - ）

此技术使用多级字典来避免排序和显式比较。

hashes = {}
diff = []
dupe = []

# build the dictionary
for sha, path, files in fail:
    try:
        hashes[sha][path].append(files)
    except KeyError:
        try:
            hashes[sha][path] = [files]
        except:
            hashes[sha] = dict((path, [files]))

for sha, paths in hashes.iteritems():
    if len(paths) > 1:
        for path, files in paths.iteritems():
            for file in files:
                diff.append([sha, path, file])
    for path, files in paths.iteritems():
        if len(files) > 1:
            for file in files:
                dupe.append([sha, path, file])

结果将是：

diff = [
    ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\d', 'Sourcecheck.py'],
    ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\b\\include', 'Test.java'],
    ['da39a3ee5e6b4b0d3255bfef95601890afd80709', 'ron\\a\\include', ['svin.txt']],
    ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\a', ['apa2.txt']],
    ['b5cc17d3a35877ca8b76f0b2e07497039c250696', 'ron\\c', 'apa1.txt']
]
dupe = [
    [['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', 'knark.txt'],
    ['95d1543adea47e88923c3d4ad56e9f65c2b40c76', 'ron\\c', ['apa.txt']]
]

Python列表问题

5 个答案: