我目前正在学习如何编码,并且遇到了过去几天来一直在努力解决的挑战。
我有 2000个以上的CSV文件,我想一次导入到特定的postgresql表中,而不是使用pgadmin 4上的import data功能,该功能一次只能导入一个CSV文件。 。我应该怎么做呢?我正在使用Windows操作系统。
答案 0 :(得分:0)
简单的方法是使用Cygwin或内部Ubuntu Shell来使用此脚本
all_files=("file_1.csv" "file_2.csv") # OR u can change to * in dir
dir_name=<path_to_files>
export PGUSER=<username_here>
export PGPASSWORD=<password_here>
export PGHOST=localhost
export PGPORT=5432
db_name=<dbname_here>
echo "write db"
for file in ${all_files[*]}; do
psql -U$db_name -a -f $dir_name/"${file}"".sql" >/dev/null
done
答案 1 :(得分:0)
如果您只想在Python中执行此操作,那么下面提供了一种方法。您可能不需要对列表进行分块(您可以一次将所有文件保存在内存中,而不必分批处理)。所有文件的大小也可能根本不同,并且您需要的不仅仅是批处理,还需要一些更复杂的文件,以防止创建超出RAM的内存文件对象。 或者,您可能会选择在2000个单独的事务中执行此操作,但是我怀疑某种批处理会更快(未经测试)。
import csv
import io
import os
import psycopg2
CSV_DIR = 'the_csv_folder/' # Relative path here, might need to be an absolute path
def chunks(l, n):
"""
https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n = max(1, n)
return [l[i:i+n] for i in range(0, len(l), n)]
# Get a list of all the CSV files in the directory
all_files = os.listdir(CSV_DIR)
# Chunk the list of files. Let's go with 100 files per chunk, can be changed
chunked_file_list = chunks(all_files, 100)
# Iterate the chunks and aggregate the files in each chunk into a single
# in-memory file
for chunk in chunked_file_list:
# This is the file to aggregate into
string_buffer = io.StringIO()
csv_writer = csv.writer(string_buffer)
for file in chunk:
with open(CSV_DIR + file) as infile:
reader = csv.reader(infile)
data = reader.readlines()
# Transfer the read data to the aggregated file
csv_writer.writerows(data)
# Now we have aggregated the chunk, copy the file to Postgres
with psycopg2.connect(dbname='the_database_name',
user='the_user_name',
password='the_password',
host='the_host') as conn:
c = conn.cursor()
# Headers need to the table field names, in the order they appear in
# the csv
headers = ['first_name', 'last_name', ...]
# Now upload the data as though it was a file
c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
conn.commit()