Question

I have a zipfile of several CSV documents. I have extracted the CSV's into a folder called "staging." These documents are encoded in Windows CP1252. What I would like to do is read in each CSV file individually as a separate dataframe and then overwrite the old files with utf8 encoding after I have removed all of the null values. Or instead of rewriting the CSVs to utf8 I can encode the database strictly from the pandas dataframes that are produced. Any help would be greatly appreciated- I have browsed the Stack Overflow forums and the main topic seems to be concatenating multiple CSV's into a single dataframe- what I need is a separate dataframe for each CSV. Also, I have to remove N/A values, however, in the CSV's they have random numbers attached to them (ie- N/A (3) or N/A(1), etc)

Here is the code I am working with:

# Create the staging directory
staging_dir = "staging"
os.mkdir(staging_dir)

# Confirm the staging directory path
os.path.isdir(staging_dir)

# Machine independent path to create files
zip_file = os.path.join(staging_dir, "Hospital_Revised_Flatfiles.zip")

# Write the files to the computer
zf = open(zip_file,"wb")
zf.write(r.content)
zf.close()

# Program to unzip the files
import zipfile

z = zipfile.ZipFile(zip_file,"r")
z.extractall(staging_dir)
z.close()

#Create the dataframes

import io
import glob
import pandas as pd

files = glob.glob(os.path.join("staging" + "/*.csv"))

# OS independent reading of files
for file in files:
    dfs = pd.read_csv(file, header = 0, encoding = 'cp1252')

Answer 1

添加

dfs.dropna().to_csv(file, encoding='utf-8')

到你的最后一个循环。它将删除所有具有空值的行，然后通过覆盖旧版本来保存数据帧。

然后删除最后一行中的第一个括号，打开两个但只关闭一个。这就是EOF错误的来源。

Answer 2

我相信P.Tillmann的解决方案应该有效。或者，您可以先加载所有数据框，然后然后将它们写回来。

files = glob.glob(os.path.join("staging" + "/*.csv"))

dict_ = {}
for file in files:
    dict_[file] = pd.read_csv(file, header=0, encoding='cp1252').dropna()

for file in dict_:
    dict_[file].to_csv(file, encoding='utf-8')

Python: Multiple dataframes from multiple CSV, encoding cp1252 to utf8

2 个答案: