Question

我有两个列表，都包含PDF的文件路径。第一个列表包含具有唯一文件名的PDF。第二个列表包含具有相同唯一文件名的文件名，这些文件名需要与第一个列表匹配，尽管第二个列表中可能有多个PDF可以与第一个匹配。这是从ListA到ListB的一对多关系。以下是一个例子。

列表A：C：\ FolderA \ A.pdf，C：\ FolderA \ B.pdf，C：\ FolderA \ C.pdf

列表B：C：\ FolderB \ A_1.pdf，C：\ FolderB \ B_1.pdf，C：\ FolderB \ C_1.pdf，C：\ FolderB \ C_2.pdf

我需要找到一种迭代两个列表的方法，并通过匹配唯一的文件名来合并PDF。如果我能找到迭代和匹配文件的方法，那么我想我可以自己组合PDF。下面是我到目前为止的代码。

folderA = C:\FolderA
ListA = []
for root, dirs, filenames in os.walk(folderA):
  for filename in filenames:
    ListA.append(str(filename))
    filepath = os.path.join(root, filename)
    ListA.append(str(filepath))

folderB: C:\FolderB
ListB = []
for root, dirs, filenames in os.walk(folderB):
  for filename in filenames:
    filepath = os.path.join(root, filename)
    folderB.append(str(filepath))

#Split ListB to file name only without the "_#" so it can be matched to the PDFs in ListA.
for pdfValue in ListB:
  pdfsplit = pdfValue.split(".")[0]
  pdfsplit1 = pdfsplit.split("\\")[-1]
  pdfsplit2 = pdfsplit1.rsplit("_", 1)[0]
  for pdfValue2 in ListA:
    if pdfsplit2 in ListA:
      #combine PDF code

我已经验证了最后一条if语句的所有内容。从这里开始，我不知道该怎么做。我知道如何在字符串中搜索子字符串，但我无法使其与列表一起正常工作。无论我如何编码，我要么以无限循环结束，要么无法成功匹配。

如果可能的话，有关如何使这项工作的任何想法？

Answer 1

最好将所有信息收集在一个数据结构中，而不是单独列表。这应该允许您将代码减少到单个函数。

完全未经测试，但这样的事情应该有用。

from collections import defaultdict

pdfs = defaultdict(lambda: defaultdict(list))

def find_pdfs(pdfs, folder, split=False):
    for root, dirs, filenames in os.walk(folder):
        for filename in filenames:
            basename, ext = os.path.splitext(filename)
            if ext == '.pdf':
                if split:
                    basename = basename.partition('_')[0]
                pdfs[basename][root].append(filename)

find_pdfs(pdfs, folderA)
find_pdfs(pdfs, folderB, True)

这应该产生这样的数据结构：

pdfs = {
    'A':
        {'C:\FolderA': ['A.pdf'],
         'C:\FolderB': ['A_1.pdf']},
    'B':
        {'C:\FolderA': ['B.pdf'],
         'C:\FolderB': ['B_1.pdf']},
    'C':
        {'C:\FolderA': ['C.pdf'],
         'C:\FolderB': ['C_1.pdf', 'C_2.pdf']},
    }

Answer 2

我认为你想要做的是创建一个collections.defaultdict并将其设置为保存匹配名称的列表。

import collections
matching_files = collections.defaultdict(list)

然后，您可以将文件夹B中的文件名删除为基本名称，并将路径放入dict中：

matching_files[pdfsplit2].append(pdfValue)

现在您有一个来自文件夹B的pdf文件列表，按基本名称分组。返回文件夹A并执行相同的操作（拆分路径和扩展名，将其用于键，添加完整路径到列表）。您将拥有列表，其中包含共享基本名称的文件。

for key,file_list in matching_files.items(): #use .iteritems() for py-2.x
    print("Files with base name '%s':"%key)
    print(' ', '\n  '.join(file_list))

Answer 3

要比较两个文件名，而不是沿'_'分割，你应该尝试使用str.startwith（）方法：

如果字符串A的开头是字符串B，则A.startwith（B）返回True。

在您的情况下，您的代码将是：

match={}                            #the dictionary where you will stock the matching names

for pdfValue in ListA:
    match[pdfValue]=[]              # To create an entry in the dictionary with the wanted keyword
    A=pdfValue.split("\\")[-1]      #You want just the filename part

    for pdfValue2 in ListB:
        B=pdfValue2.split("\\")[-1]

        if B.startswith(A):        # Then B has the same unique namefile than A

            match[pdfValue].append(pdfValue2)  #so you associate it with A in the dictionnary

我希望它适合你

Answer 4

还有一个解决方案

lista = ['C:\FolderA\A.pdf', 'C:\FolderA\B.pdf', 'C:\FolderA\C.pdf']
listb = ['C:\FolderB\A_1.pdf', 'C:\FolderB\B_1.pdf', 'C:\FolderB\C_1.pdf', 'C:\FolderB\C_2.pdf']

# get the filenames for folder a and folder b
lista_filenames =  [l.split('\\')[-1].split('.')[0] for l in lista]
listb_filenames =  [l.split('\\')[-1].split('.')[0] for l in listb]

# create a dictionary to store lists of mappings
from collections import defaultdict
data_structure = defaultdict(list)

for i in lista_filenames:
     for j in listb_filenames:
         if i in j:
            data_structure['C:\\FolderA\\' + i +'.pdf'].append('C:\\FolderB\\' + j +'.pdf') 

# this is how the mapping dictionary looks like
print data_structure

结果：

defaultdict(<type 'list'>, {'C:\\FolderA\\C.pdf': ['C:\\FolderB\\C_1.pdf', 'C:\\FolderB\\C_2.pdf'], 'C:\\FolderA\\A.pdf': ['C:\\FolderB\\A_1.pdf'], 'C:\\FolderA\\B.pdf': ['C:\\FolderB\\B_1.pdf']})

Python：使用唯一值搜索列表内的子串

4 个答案: