I have a zipfile of several CSV documents. I have extracted the CSV's into a folder called "staging." These documents are encoded in Windows CP1252. What I would like to do is read in each CSV file individually as a separate dataframe and then overwrite the old files with utf8 encoding after I have removed all of the null values. Or instead of rewriting the CSVs to utf8 I can encode the database strictly from the pandas dataframes that are produced. Any help would be greatly appreciated- I have browsed the Stack Overflow forums and the main topic seems to be concatenating multiple CSV's into a single dataframe- what I need is a separate dataframe for each CSV. Also, I have to remove N/A values, however, in the CSV's they have random numbers attached to them (ie- N/A (3) or N/A(1), etc)
Here is the code I am working with:
# Create the staging directory
staging_dir = "staging"
os.mkdir(staging_dir)
# Confirm the staging directory path
os.path.isdir(staging_dir)
# Machine independent path to create files
zip_file = os.path.join(staging_dir, "Hospital_Revised_Flatfiles.zip")
# Write the files to the computer
zf = open(zip_file,"wb")
zf.write(r.content)
zf.close()
# Program to unzip the files
import zipfile
z = zipfile.ZipFile(zip_file,"r")
z.extractall(staging_dir)
z.close()
#Create the dataframes
import io
import glob
import pandas as pd
files = glob.glob(os.path.join("staging" + "/*.csv"))
# OS independent reading of files
for file in files:
dfs = pd.read_csv(file, header = 0, encoding = 'cp1252')
答案 0 :(得分:0)
添加
dfs.dropna().to_csv(file, encoding='utf-8')
到你的最后一个循环。它将删除所有具有空值的行,然后通过覆盖旧版本来保存数据帧。
然后删除最后一行中的第一个括号,打开两个但只关闭一个。这就是EOF错误的来源。
答案 1 :(得分:0)
我相信P.Tillmann的解决方案应该有效。或者,您可以先加载所有数据框,然后然后将它们写回来。
files = glob.glob(os.path.join("staging" + "/*.csv"))
dict_ = {}
for file in files:
dict_[file] = pd.read_csv(file, header=0, encoding='cp1252').dropna()
for file in dict_:
dict_[file].to_csv(file, encoding='utf-8')