如何在Google云端硬盘中搜索子文件夹和子子文件夹?

时间:2017-01-19 12:12:03

标签: google-drive-api

这是一个常见问题。

场景是: -

folderA____ folderA1____folderA1a
       \____folderA2____folderA2a
                    \___folderA2b

...问题是我如何列出根folderA下所有文件夹中的所有文件。

4 个答案:

答案 0 :(得分:18)

首先要了解的是,在Google云端硬盘中,文件夹不是文件夹!

我们已经习惯了Windows / nix等文件夹(aka目录)的概念。在现实世界中,文件夹是一个容器,文档放在其中。也可以将较小的文件夹放在较大的文件夹中。因此,可以将大文件夹视为包含其较小子文件夹中的所有文档。

但是,在Google云端硬盘中,文件夹是一个容器,以至于在Google云端硬盘的第一个版本中,它们甚至不称为文件夹,它们被称为收藏夹。文件夹只是一个文件,其中包含(a)无内容,(b)特殊的mime类型(application / vnd.google-apps.folder)。使用文件夹的方式是完全,就像使用标签(也就是标签)一样。理解这一点的最好方法是考虑GMail。如果查看打开的邮件项目的顶部,则会看到两个图标。带有工具提示的文件夹"移至"和带有工具提示的标签"标签"。单击其中任何一个,将出现相同的对话框,所有这些都与标签有关。您的标签在左侧列出,在树状显示中看起来很像文件夹。重要的是,邮件项目可以有多个标签,或者您可以说,邮件项目可以位于多个文件夹中。 Google云端硬盘的文件夹与GMail标签的工作方式完全相同。

确定文件夹只是一个标签,没有什么能阻止你在类似于文件夹树的层次结构中组织标签,事实上这是最常见的方式。

现在应该很清楚,文件夹A2b中的文件(让我们称之为MyFile)不是folderA的子孙。它只是一个带有标签(容易混淆地称为父母)的文件" folderA2b"。

好的,那么如何获取所有文件"" folderA吗

替代方案1.递归

诱惑就是列出folderA的子项,对于任何子文件夹,递归列出他们的孩子,冲洗,重复。在极少数情况下,这可能是最好的方法,但对大多数情况来说,它有以下问题: -

  • 为每个子文件夹执行服务器往返是非常耗时的。这当然取决于树的大小,所以如果你能保证树的大小很小,那就可以了。

备选方案2.共同父母

如果您的应用程序正在创建所有文件(即您正在使用drive.file范围),则此方法效果最佳。除了上面的文件夹层次结构,还要创建一个名为say" MyAppCommonParent"的虚拟父文件夹。当您将每个文件创建为其特定文件夹的子文件时,您还将其设为MyAppCommonParent的子文件。如果您记得将文件夹视为标签,这将变得更加直观。现在,您只需查询MyAppCommonParent in parents即可轻松检索所有项目。

备用3.文件夹优先

首先获取所有文件夹。是的,所有这些。将它们全部存储在内存中后,您可以爬行其父属性并构建树结构和文件夹ID列表。然后,您可以执行单个files.list?q='folderA' in parents or 'folderA1' in parents or 'folderA1a' in parents...。使用这种技术,您可以通过两次http调用获取所有内容。

选项3的伪代码有点像......

// get all folders from Drive files.list?q=mimetype=application/vnd.google-apps.folder and trashed=false&fields=parents,name // store in a Map, keyed by ID // find the entry for folderA and note the ID // find any entries where the ID is in the parents, note their IDs // for each such entry, repeat recursively // use all of the IDs noted above to construct a ... // files.list?q='folderA-ID' in parents or 'folderA1-ID' in parents or 'folderA1a-ID' in parents...

备选方案2是最有效的,但只有在您控制文件创建时才有效。替代方案3通常比替代方案1更有效,但可能存在某些小树大小,其中1是最佳的。

答案 1 :(得分:2)

将@pinoyyid的Python解决方案共享给优秀的 Alternative 3 ,以防对任何人有用。我不是开发人员,所以它可能是无法使用Python的...但是它可以工作,只能进行2次API调用,而且速度很快。

  1. 获取驱动器中所有文件夹的主列表。
  2. 测试要搜索的文件夹是否为父文件夹(即它具有子文件夹)。
  3. 遍历要搜索的文件夹的子文件夹,以测试它们是否也是父母。
  4. 构建一个Google云端硬盘文件查询,每个找到的子文件夹有一个'<folder-id>' in parents段。

有趣的是,Google云端硬盘似乎对每个查询有599 '<folder-id>' in parents个段的硬限制,因此,如果要搜索的文件夹中有更多子文件夹,则需要对列表进行分块。

FOLDER_TO_SEARCH = '123456789'  # ID of folder to search
DRIVE_ID = '654321'  # ID of shared drive in which it lives
MAX_PARENTS = 500  # Limit set safely below Google max of 599 parents per query.


def get_all_folders_in_drive():
    """
    Return a dictionary of all the folder IDs in a drive mapped to their parent folder IDs (or to the
    drive itself if a top-level folder). That is, flatten the entire folder structure.
    """
    folders_in_drive_dict = {}
    page_token = None
    max_allowed_page_size = 1000
    just_folders = "trashed = false and mimeType = 'application/vnd.google-apps.folder'"
    while True:
        results = drive_api_ref.files().list(
            pageSize=max_allowed_page_size,
            fields="nextPageToken, files(id, name, mimeType, parents)",
            includeItemsFromAllDrives=True, supportsAllDrives=True,
            corpora='drive',
            driveId=DRIVE_ID,
            pageToken=page_token,
            q=just_folders).execute()
        folders = results.get('files', [])
        page_token = results.get('nextPageToken', None)
        for folder in folders:
            folders_in_drive_dict[folder['id']] = folder['parents'][0]
        if page_token is None:
            break
    return folders_in_drive_dict


def get_subfolders_of_folder(folder_to_search, all_folders):
    """
    Yield subfolders of the folder-to-search, and then subsubfolders etc. Must be called by an iterator.
    :param all_folders: The dictionary returned by :meth:`get_all_folders_in-drive`.
    """
    temp_list = [k for k, v in all_folders.items() if v == folder_to_search]  # Get all subfolders
    for sub_folder in temp_list:  # For each subfolder...
        yield sub_folder  # Return it
        yield from get_subfolders_of_folder(sub_folder, all_folders)  # Get subsubfolders etc


def get_relevant_files(self, relevant_folders):
    """
    Get files under the folder-to-search and all its subfolders.
    """
    relevant_files = {}
    chunked_relevant_folders_list = [relevant_folders[i:i + MAX_PARENTS] for i in
                                     range(0, len(relevant_folders), MAX_PARENTS)]
    for folder_list in chunked_relevant_folders_list:
        query_term = ' in parents or '.join('"{0}"'.format(f) for f in folder_list) + ' in parents'
        relevant_files.update(get_all_files_in_folders(query_term))
    return relevant_files


def get_all_files_in_folders(self, parent_folders):
    """
    Return a dictionary of file IDs mapped to file names for the specified parent folders.
    """
    files_under_folder_dict = {}
    page_token = None
    max_allowed_page_size = 1000
    just_files = f"mimeType != 'application/vnd.google-apps.folder' and trashed = false and ({parent_folders})"
    while True:
        results = drive_api_ref.files().list(
            pageSize=max_allowed_page_size,
            fields="nextPageToken, files(id, name, mimeType, parents)",
            includeItemsFromAllDrives=True, supportsAllDrives=True,
            corpora='drive',
            driveId=DRIVE_ID,
            pageToken=page_token,
            q=just_files).execute()
        files = results.get('files', [])
        page_token = results.get('nextPageToken', None)
        for file in files:
            files_under_folder_dict[file['id']] = file['name']
        if page_token is None:
            break
    return files_under_folder_dict


if __name__ == "__main__":
    all_folders_dict = get_all_folders_in_drive()  # Flatten folder structure
    relevant_folders_list = [FOLDER_TO_SEARCH]  # Start with the folder-to-archive
    for folder in get_subfolders_of_folder(FOLDER_TO_SEARCH, all_folders_dict):
        relevant_folders_list.append(folder)  # Recursively search for subfolders
    relevant_files_dict = get_relevant_files(relevant_folders_list)  # Get the files

答案 2 :(得分:1)

使用递归共享javascript解决方案来构建文件夹数组,从第一级文件夹开始,然后向下移动层次结构。该数组是通过递归循环所讨论文件的父ID组成的。

下面的摘录对gapi进行3个单独的查询:

  1. 获取根文件夹ID
  2. 获取文件夹列表
  3. 获取文件列表

代码遍历文件列表,然后创建一个文件夹名称数组。

const { google } = require('googleapis')
const gOAuth =  require('./googleOAuth')

// resolve the promises for getting G files and folders
const getGFilePaths = async () => {
  //update to use Promise.All()
  let gRootFolder = await getGfiles().then(result => {return result[2][0]['parents'][0]})
  let gFolders = await getGfiles().then(result => {return result[1]})
  let gFiles = await getGfiles().then(result => {return result[0]})
  // create the path files and create a new key with array of folder paths, returning an array of files with their folder paths
  return pathFiles = gFiles
                      .filter((file) => {return file.hasOwnProperty('parents')})
                      .map((file) => ({...file, path: makePathArray(gFolders, file['parents'][0], gRootFolder)}))
}

// recursive function to build an array of the file paths top -> bottom
let makePathArray = (folders, fileParent, rootFolder) => {
  if(fileParent === rootFolder){return []}
  else {
    let filteredFolders = folders.filter((f) => {return f.id === fileParent})
    if(filteredFolders.length >= 1 && filteredFolders[0].hasOwnProperty('parents')) {
      let path = makePathArray(folders, filteredFolders[0]['parents'][0])
      path.push(filteredFolders[0]['name'])
      return path
    }
    else {return []}
  }
}

// get meta-data list of files from gDrive, with query parameters
const getGfiles = () => {
  try {
    let getRootFolder = getGdriveList({corpora: 'user', includeItemsFromAllDrives: false,
    fields: 'files(name, parents)', 
    q: "'root' in parents and trashed = false and mimeType = 'application/vnd.google-apps.folder'"})
  
    let getFolders = getGdriveList({corpora: 'user', includeItemsFromAllDrives: false,
    fields: 'files(id,name,parents), nextPageToken', 
    q: "trashed = false and mimeType = 'application/vnd.google-apps.folder'"})
  
    let getFiles = getGdriveList({corpora: 'user', includeItemsFromAllDrives: false,
    fields: 'files(id,name,parents, mimeType, fullFileExtension, webContentLink, exportLinks, modifiedTime), nextPageToken', 
    q: "trashed = false and mimeType != 'application/vnd.google-apps.folder'"})
  
    return Promise.all([getFiles, getFolders, getRootFolder])
  }
  catch(error) {
    return `Error in retriving a file reponse from Google Drive: ${error}`
  }
}

// make call out gDrive to get meta-data files. Code adds all files in a single array which are returned in pages
const getGdriveList = async (params) => {
  const gKeys = await gOAuth.get()
  const drive = google.drive({version: 'v3', auth: gKeys})
  let list = []
  let nextPgToken
  do {
    let res = await drive.files.list(params)
    list.push(...res.data.files)
    nextPgToken = res.data.nextPageToken
    params.pageToken = nextPgToken
  }
  while (nextPgToken)
  return list
}

答案 3 :(得分:0)

以下方法效果很好,但需要额外调用API。

与任何电子邮件地址共享搜索的根文件夹(文件夹A)。 将此其他项目添加到您的查询中:“读者中的'sharedEmailAddress'” 这样会将结果限制为文件夹和子文件夹中的所有内容。

示例:与电子邮件地址共享文件夹A,然后使用此查询进行搜索。

“阅读器中的'sharedEmailAddress'和fullText包含'要搜索的文本'”