如何仅从Google云端存储中读取csv的第一行?

时间:2018-12-06 17:53:55

标签: python google-cloud-platform google-cloud-storage

我已经看到了这个问题:How to read first 2 rows of csv from Google Cloud Storage

但是对于我来说,我不想将整个csv blob加载到内存中,因为它可能很大。有什么方法可以将其作为可迭代对象(或类似文件的对象)打开,并且仅读取前几行的字节吗?

3 个答案:

答案 0 :(得分:2)

google.cloud.storage.blob.Blob的API指定download_as_string方法具有提供字节范围的startend关键字:

  

https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html#google.cloud.storage.blob.Blob

答案 1 :(得分:2)

如果您的计算机上安装了gsutil:

import subprocess
uri = 'gs://my-bucket/my-file.txt'
input_file_columns = subprocess.getoutput(f'gsutil cp {uri} - | head -1')

您还可以使用名为gcsfs(pip install gcsfs)的工具

>>> import gcsfs
>>> fs = gcsfs.GCSFileSystem(project='my-google-project')
>>> fs.ls('my-bucket')
['my-file.txt']
>>> fs.read_block('gs://my-bucket/my-file.txt', offset=1000, length=10, delimiter=b'\n')
b'A whole line of text\n'

GCSFS还具有head方法。 https://gcsfs.readthedocs.io/en/latest/

答案 2 :(得分:0)

想通过在不知道CSV标头大小的情况下如何创建可迭代的示例来扩展模拟答案。对于逐行从数据存储中读取CSV也可能有用:

def get_csv_header(blob):
    for line in csv.reader(blob_lines(blob)):
        return line


# How much bytes of blob download using one request.
# Selected experimentally. If there is more optimal value for this - please update.
BLOB_CHUNK_SIZE = 2000


def blob_lines(blob: storage.blob.Blob) -> Generator[str, None, None]:
    position = 0
    buff = []
    while True:
        chunk = blob.download_as_string(start=position, end=position + BLOB_CHUNK_SIZE).decode()
        if '\n' in chunk:
            part1, part2 = chunk.split('\n', 1)
            buff.append(part1)
            yield ''.join(buff)
            parts = part2.split('\n')
            for part in parts[:-1]:
                yield part
            buff = [parts[-1]]
        else:
            buff.append(chunk)

        position += BLOB_CHUNK_SIZE + 1  # Blob chunk is downloaded using closed interval
        if len(chunk) < BLOB_CHUNK_SIZE:
            yield ''.join(buff)
            return