Question

我正在尝试从image-net.org下载边界框文件（存储为gzip tar归档文件）。当我print(resp.read())而不是代表存档的字节流时，我得到HTML b'<meta http-equiv="refresh" content="0;url=/downloads/bbox/bbox/[wnid].tar.gz" />\n，其中[wnid]指的是特定的词网标识字符串。这导致错误tarfile.ReadError: file could not be opened successfully。是否对问题究竟是什么和/或如何解决有任何想法？代码在下面（images是pandas数据帧）。

def get_boxes(images, nthreads=1000):

    def parse_xml(xml):
        return 0

    def read_tar(data, wnid):
        bytes = io.BytesIO(data)
        tar = tarfile.open(fileobj=bytes)
        return 0

    async def fetch_boxes(wnid, client):
        url = ('http://www.image-net.org/api/download/imagenet.bbox.'
            'synset?wnid={}').format(wnid)
        async with client.get(url) as resp:
            res = await loop.run_in_executor(executor, read_tar,
                await resp.read(), wnid)
            return res

    async def main():
        async with aiohttp.ClientSession(loop=loop) as client:
            tasks = [asyncio.ensure_future(fetch_boxes(wnid, client))
                for wnid in images['wnid'].unique()]
            return await asyncio.gather(*tasks)

    loop = asyncio.get_event_loop()
    executor = ThreadPoolExecutor(nthreads)
    shapes, boxes = zip(*loop.run_until_complete(main()))
    return pd.concat(shapes, axis=0), pd.concat(boxes, axis=0)

编辑：我现在知道这是一个用作重定向的meta refresh。会在`aiohttp中将其视为“错误”吗？

Answer 1

没关系。

某些服务已从用户友好的网页重定向到zip文件。有时，它是使用HTTP状态（301或302，请参见下面的示例）或使用带有包含重定向的元标记的页面实现的，例如您的示例。

HTTP/1.1 302 Found
Location: http://www.iana.org/domains/example/

aiohttp可以自动处理第一种情况（默认情况下为allow_redirects = True）。
但是在第二种情况下，库会检索简单的HTML，并且无法自动处理。

Answer 2

我遇到了同样的问题\ n 当我尝试使用与您相同的网址使用wget下载时 http://www.image-net.org/api/download/imagenet.bbox.synset?wnid=n01729322

，但直接输入即可使用 www.image-net.org/downloads/bbox/bbox/n01729322.tar.gz

ps。 n01729322是wnid

aiohttp：client.get（）返回html标记而不是文件

2 个答案: