从多个CSV创建大型数据集,而不会耗尽内存

时间:2018-03-11 02:50:12

标签: python-3.x pandas memory

我正在尝试从几个CSV文件的组合中创建一个pandas数据帧对象。问题是,当我尝试将所有内容加载到一个数据帧中时,我遇到内存问题。我已经探索了使用" chunk size"加载时的参数,但是这个应用程序的所有应用程序都说CSV应该具有相同的行数,而在我的项目中并非如此。

这些功能分布在多个CSV中,并且每个功能都与相应索引#的观察相关。但是,并非所有CSV都针对每个观察都有一行。我想通过匹配索引列将所有CSV组合成一个数据帧(或一系列数据帧,每个数据帧包含所有列)

示例:

  • CSV1有200万行。
  • CSV2有150万,其中100万与CSV1具有匹配的索引号。
  • 这两个CSV应该组合成一个有250万行的数据帧; 100万分享,CSV1独有的100万,以及CSV2独有的50万
  • 然后重复此过程以获得额外的9 csv

有没有人对如何做到这一点有一些建议?我希望最终结果是一批数据帧,但我希望它们都具有相同数量的列和不同的行。谢谢。

编辑: 所以我一直在研究它,并且已经对问题有了更多的了解,它可能不一定是一个大小问题,但可能是我的编程和内存分配。

CSV文件的大小不是Gb,而是包含所有CSV文件的文件夹只有大约100 Mb。我认为主要是对象数据在转换为pandas数据帧时导致大量扩展。我遵循了减少数据框大小的教程,并开始在同一点遇到一些问题。

首先,这是我的代码......

import os
import urllib
import pandas as pd
import numpy as np

FARS_PATH = "Data/2016"

# Function to reduce integer and float data types
def number_downcast(int_columns, float_columns):
    converted_int = int_columns.apply(pd.to_numeric, downcast='unsigned')
    converted_float = float_columns.apply(pd.to_numeric, downcast='float')

    return converted_int, converted_float

# Function to reduce objects to categories
def object_to_category(object_columns):
    converted_categories = pd.DataFrame()

    for col in object_columns.columns:
        num_unique_values = len(object_columns[col].unique())
        num_total_values = len(object_columns[col])
        if num_unique_values / num_total_values < 0.5:
            converted_categories.loc[:,col] = object_columns[col].astype('category')
        else:
            converted_categories.loc[:,col] = object_columns[col]

    return converted_categories

# Function to reduce whole dataframe using above functions
def optimize_dataframe(dataset):
    int_columns = dataset.select_dtypes(include=['int'])
    float_columns = dataset.select_dtypes(include=['float'])
    object_columns = dataset.select_dtypes(include=['object']).copy()

    converted_ints, converted_floats = number_downcast(int_columns, float_columns)
    converted_categories = object_to_category(object_columns)

    optimized_dataset = dataset.copy()

    optimized_dataset[converted_ints.columns] = converted_ints
    optimized_dataset[converted_floats.columns] = converted_floats
    optimized_dataset[converted_categories.columns] = converted_categories

    return optimized_dataset

# Indexing column is "ST_CASE"
def load_the_data(data_path = DATA_PATH):
    FISRT_csv_path = os.path.join(data_path, "one.csv")
    SECOND_csv_path = os.path.join(data_path, "two.csv")
    THIRD_csv_path = os.path.join(data_path, "three.csv")
    FOURTH_csv_path = os.path.join(data_path, "four.csv")
    FIFTH_csv_path = os.path.join(data_path, "five.csv")
    SIXTH_csv_path = os.path.join(data_path, "six.csv")
    SEVENTH_csv_path = os.path.join(data_path, "seven.csv")


    # FIRST data has 34,439 rows and 52 columns
    # FIRST data before optimization: float64(2), int64(47), object(3), 18.5 Mb
    # FIRST data after optimization: category(3), float32(2), uint16(3), uint32(2), uint8(42), 4.8 Mb
    FIRST_data = pd.read_csv(FIRST_csv_path, low_memory=False)

    # SECOND data has 52,231 rows and 105 columns
    # SECOND Data before optimization: int64(87), object(18), 94.3 Mb 
    # SECOND Data after optimizaton: category(17), object(1), uint16(13), uint32(3), uint(71), 10.7 Mb
    SECOND_data = pd.read_csv(SECOND_csv_path, low_memory=False)
    merged_data = pd.merge(FIRST_data, SECOND_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
    del FIRST_data, SECOND_data   # Remove excess dataframes from memory
    merged_data = optimize_dataframe(merged_data)

    # THIRD data has 85,469 rows and 68 colunns
    # THIRD data before optimization: float64(10), int64(58), 44.4 Mb
    # THIRD data after optimization: float32(10), uint16(9), uint32(1), uint8(48). 9 Mb
    THIRD_data = pd.read_csv(THIRD_csv_path, low_memory=False)
    THIRD_data = optimize_dataframe(THIRD_data)
    merged_data = pd.merge(merged_data, THIRD_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
    del THIRD_data   # Remove excess dataframes from memory

    # FOURTH data has 1,367 rows and 60 columns
    # FOURTH data before optimization: int64(43), object(17), 1.9 Mb
    # FOURTH data after optimization: category(16), object(1), uint16(3), uint32(2), uint64(1), uint8(37) 262.9 Kb
    FOURTH_data = pd.read_csv(FOURTH_csv_path, low_memory=False)
    merged_data = pd.merge(merged_data, FOURTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
    del FOURTH_data   # Remove excess dataframes from memory
    merged_data = optimize_dataframe(merged_data)

    # FIFTH data has 7,448 rows and 24 columns
    # FIFTH data before optimization: int64(23), object(1) 1.8 Mb
    # FIFTH data after optimization: category(1), uint16(5), uint32(1), uint8(17), 236.1 Kb
    FIFTH_data = pd.read_csv(FIFTH_csv_path, low_memory=False)
    FIFTH_data = pd.merge(merged_data, FIFTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
    del FIFTH_data   # Remove excess dataframes from memory
    merged_data = optimize_dataframe(merged_data)

    # SIXTH data has 102,861 rows and 8 columns
    # SIXTH data before optimization: int64(8), 6.3 Mb
    # SIXTH data after optimization: uint16(1), uint32(1), uint8(6), 1.2 Mb
    SIXTH_data = pd.read_csv(SIXTH_csv_path, low_memory=False)
    merged_data = pd.merge(merged_data, SIXTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
    del SIXTH_data   # Remove excess dataframes from memory
    merged_data = optimize_dataframe(merged_data)

    # SEVENTH data has 122,022 rows and 10 columns
    # SEVENTH data before optimization: int64(10), 9.3 Mb
    # SEVENTH data after optimization unit16(1), uint32(1), uint8(8) 1.6 Mb
    SEVENTH_data = pd.read_csv(SEVENTH_csv_path, low_memory=False)
    SEVENTH_data = pd.merge(merged_data, SEVENTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
    del SEVENTH_data   # Remove excess dataframes from memory
    merged_data = optimize_dataframe(merged_data)

    return merged_data

然后我运行它来获取有关返回数据框的信息:

data_test_set = load_the_data()
data_test_set.info(memory_usage='deep')

在第七个CSV文件发生内存错误之后。但是我尝试在每个CSv之后运行信息命令,并注意到即使在第三个CSV之后,信息也会返回:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162518 entries, 0 to 162517
Columns: 223 entries, STATE_x to LOCATION
dtypes: category(20), float32(12), object(1), uint16(25), uint32(4), uint8(161)
memory usage: 61.8 MB

这让我相信我可能不会错误地组合数据帧,因为前三个组合的行数太多。

编辑#2:有可能这个数据集的索引比只匹配那一列要复杂一点,因为每个索引可能有3-4个不同的观察结果,这就是文件的原因正在迅速扩张。

2 个答案:

答案 0 :(得分:0)

当csv文件太大时,以下是我的工作过程。我把它分成数据库,然后用我需要的列查询它。此外,如果您需要进行计算,还可以在for循环中对其进行块化,并将最终结果再次存储在数据库中。

import sqlalchemy as sa
import pandas as pd
import psycopg2

count = 0
con = sa.create_engine('postgresql://postgres:pwd@localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('yourfile', chunksize=10000, encoding="ISO-8859-1",
                     sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

根据您的文件大小,您最好优化chunksize,例如我总是使用总行数除以30,但这一切都取决于您的数据类型

for chunk in chunks:
        chunk.to_sql(name='Table', if_exists='append', con=con)
        count += 1
        print(count)

答案 1 :(得分:0)

所以是的,答案是我需要更加关注我庞大的数据集。索引列仅允许总共50,000个不同的值,因此我需要找到一些其他方法来组合它们。也许编写各种功能,结合给定应用程序的必要CSV。