我正在尝试从几个CSV文件的组合中创建一个pandas数据帧对象。问题是,当我尝试将所有内容加载到一个数据帧中时,我遇到内存问题。我已经探索了使用" chunk size"加载时的参数,但是这个应用程序的所有应用程序都说CSV应该具有相同的行数,而在我的项目中并非如此。
这些功能分布在多个CSV中,并且每个功能都与相应索引#的观察相关。但是,并非所有CSV都针对每个观察都有一行。我想通过匹配索引列将所有CSV组合成一个数据帧(或一系列数据帧,每个数据帧包含所有列)
示例:
有没有人对如何做到这一点有一些建议?我希望最终结果是一批数据帧,但我希望它们都具有相同数量的列和不同的行。谢谢。
编辑: 所以我一直在研究它,并且已经对问题有了更多的了解,它可能不一定是一个大小问题,但可能是我的编程和内存分配。
CSV文件的大小不是Gb,而是包含所有CSV文件的文件夹只有大约100 Mb。我认为主要是对象数据在转换为pandas数据帧时导致大量扩展。我遵循了减少数据框大小的教程,并开始在同一点遇到一些问题。
首先,这是我的代码......
import os
import urllib
import pandas as pd
import numpy as np
FARS_PATH = "Data/2016"
# Function to reduce integer and float data types
def number_downcast(int_columns, float_columns):
converted_int = int_columns.apply(pd.to_numeric, downcast='unsigned')
converted_float = float_columns.apply(pd.to_numeric, downcast='float')
return converted_int, converted_float
# Function to reduce objects to categories
def object_to_category(object_columns):
converted_categories = pd.DataFrame()
for col in object_columns.columns:
num_unique_values = len(object_columns[col].unique())
num_total_values = len(object_columns[col])
if num_unique_values / num_total_values < 0.5:
converted_categories.loc[:,col] = object_columns[col].astype('category')
else:
converted_categories.loc[:,col] = object_columns[col]
return converted_categories
# Function to reduce whole dataframe using above functions
def optimize_dataframe(dataset):
int_columns = dataset.select_dtypes(include=['int'])
float_columns = dataset.select_dtypes(include=['float'])
object_columns = dataset.select_dtypes(include=['object']).copy()
converted_ints, converted_floats = number_downcast(int_columns, float_columns)
converted_categories = object_to_category(object_columns)
optimized_dataset = dataset.copy()
optimized_dataset[converted_ints.columns] = converted_ints
optimized_dataset[converted_floats.columns] = converted_floats
optimized_dataset[converted_categories.columns] = converted_categories
return optimized_dataset
# Indexing column is "ST_CASE"
def load_the_data(data_path = DATA_PATH):
FISRT_csv_path = os.path.join(data_path, "one.csv")
SECOND_csv_path = os.path.join(data_path, "two.csv")
THIRD_csv_path = os.path.join(data_path, "three.csv")
FOURTH_csv_path = os.path.join(data_path, "four.csv")
FIFTH_csv_path = os.path.join(data_path, "five.csv")
SIXTH_csv_path = os.path.join(data_path, "six.csv")
SEVENTH_csv_path = os.path.join(data_path, "seven.csv")
# FIRST data has 34,439 rows and 52 columns
# FIRST data before optimization: float64(2), int64(47), object(3), 18.5 Mb
# FIRST data after optimization: category(3), float32(2), uint16(3), uint32(2), uint8(42), 4.8 Mb
FIRST_data = pd.read_csv(FIRST_csv_path, low_memory=False)
# SECOND data has 52,231 rows and 105 columns
# SECOND Data before optimization: int64(87), object(18), 94.3 Mb
# SECOND Data after optimizaton: category(17), object(1), uint16(13), uint32(3), uint(71), 10.7 Mb
SECOND_data = pd.read_csv(SECOND_csv_path, low_memory=False)
merged_data = pd.merge(FIRST_data, SECOND_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
del FIRST_data, SECOND_data # Remove excess dataframes from memory
merged_data = optimize_dataframe(merged_data)
# THIRD data has 85,469 rows and 68 colunns
# THIRD data before optimization: float64(10), int64(58), 44.4 Mb
# THIRD data after optimization: float32(10), uint16(9), uint32(1), uint8(48). 9 Mb
THIRD_data = pd.read_csv(THIRD_csv_path, low_memory=False)
THIRD_data = optimize_dataframe(THIRD_data)
merged_data = pd.merge(merged_data, THIRD_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
del THIRD_data # Remove excess dataframes from memory
# FOURTH data has 1,367 rows and 60 columns
# FOURTH data before optimization: int64(43), object(17), 1.9 Mb
# FOURTH data after optimization: category(16), object(1), uint16(3), uint32(2), uint64(1), uint8(37) 262.9 Kb
FOURTH_data = pd.read_csv(FOURTH_csv_path, low_memory=False)
merged_data = pd.merge(merged_data, FOURTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
del FOURTH_data # Remove excess dataframes from memory
merged_data = optimize_dataframe(merged_data)
# FIFTH data has 7,448 rows and 24 columns
# FIFTH data before optimization: int64(23), object(1) 1.8 Mb
# FIFTH data after optimization: category(1), uint16(5), uint32(1), uint8(17), 236.1 Kb
FIFTH_data = pd.read_csv(FIFTH_csv_path, low_memory=False)
FIFTH_data = pd.merge(merged_data, FIFTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
del FIFTH_data # Remove excess dataframes from memory
merged_data = optimize_dataframe(merged_data)
# SIXTH data has 102,861 rows and 8 columns
# SIXTH data before optimization: int64(8), 6.3 Mb
# SIXTH data after optimization: uint16(1), uint32(1), uint8(6), 1.2 Mb
SIXTH_data = pd.read_csv(SIXTH_csv_path, low_memory=False)
merged_data = pd.merge(merged_data, SIXTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
del SIXTH_data # Remove excess dataframes from memory
merged_data = optimize_dataframe(merged_data)
# SEVENTH data has 122,022 rows and 10 columns
# SEVENTH data before optimization: int64(10), 9.3 Mb
# SEVENTH data after optimization unit16(1), uint32(1), uint8(8) 1.6 Mb
SEVENTH_data = pd.read_csv(SEVENTH_csv_path, low_memory=False)
SEVENTH_data = pd.merge(merged_data, SEVENTH_data, left_on="ST_CASE", right_on="ST_CASE", how="left")
del SEVENTH_data # Remove excess dataframes from memory
merged_data = optimize_dataframe(merged_data)
return merged_data
然后我运行它来获取有关返回数据框的信息:
data_test_set = load_the_data()
data_test_set.info(memory_usage='deep')
在第七个CSV文件发生内存错误之后。但是我尝试在每个CSv之后运行信息命令,并注意到即使在第三个CSV之后,信息也会返回:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 162518 entries, 0 to 162517
Columns: 223 entries, STATE_x to LOCATION
dtypes: category(20), float32(12), object(1), uint16(25), uint32(4), uint8(161)
memory usage: 61.8 MB
这让我相信我可能不会错误地组合数据帧,因为前三个组合的行数太多。
编辑#2:有可能这个数据集的索引比只匹配那一列要复杂一点,因为每个索引可能有3-4个不同的观察结果,这就是文件的原因正在迅速扩张。
答案 0 :(得分:0)
当csv文件太大时,以下是我的工作过程。我把它分成数据库,然后用我需要的列查询它。此外,如果您需要进行计算,还可以在for循环中对其进行块化,并将最终结果再次存储在数据库中。
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd@localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('yourfile', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
根据您的文件大小,您最好优化chunksize,例如我总是使用总行数除以30,但这一切都取决于您的数据类型
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
答案 1 :(得分:0)
所以是的,答案是我需要更加关注我庞大的数据集。索引列仅允许总共50,000个不同的值,因此我需要找到一些其他方法来组合它们。也许编写各种功能,结合给定应用程序的必要CSV。