Question

我在多个文件夹和子目录（〜400）中有多个dbf文件（〜4,550），各个文件和子目录（〜400）之间按状态分开。每周都会以dbf文件的形式将数据提供给我，并按州分开。

例如

"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_APRIL_2019\DAT_01_APRIL_2019\NJ\DATA1071.DBF"

"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5393.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\FL\DATA5414.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA890.DBF"
"Datafiles\DAT_01_JUly_2019\DAT_01_JUlY_2019\NJ\DATA1071.DBF"

对于每个州，我如何将所有dbf文件转换并合并为一个csv，即如何将州分开（用于区域数据分析）？

当前在Windows 10上使用Python 3和Jupyter笔记本。

使用python可以解决此问题，我尝试尝试使用dbf2csv和其他dbf和csv函数。

以下代码显示了一些不错的起点。研究是通过许多帖子和我自己的实验完成的。我仍然开始使用python处理文件，但是我不确定如何围绕繁琐的任务编写代码。

我通常使用以下功能将其转换为csv，然后在命令提示符下一行将所有csv文件合并为一个。

下面的函数将一个特定的dbf转换为csv

import csv
from dbfread import DBF

def dbf_to_csv(dbf_table_pth):#Input a dbf, output a csv, same name, same path, except extension
    csv_fn = dbf_table_pth[:-4]+ ".csv" #Set the csv file name
    table = DBF(dbf_table_pth)# table variable is a DBF object
    with open(csv_fn, 'w', newline = '') as f:# create a csv file, fill it with dbf content
        writer = csv.writer(f)
        writer.writerow(table.field_names)# write the column name
        for record in table:# write the rows
            writer.writerow(list(record.values()))
    return csv_fn# return the csv name

下面的脚本将给定文件夹中的所有dbf文件转换为csv格式。这很好用，但是没有考虑子文件夹和子目录。

import fnmatch
import os
import csv
import time
import datetime
import sys
from dbfread import DBF, FieldParser, InvalidValue          
# pip install dbfread if needed

class MyFieldParser(FieldParser):
    def parse(self, field, data):
        try:
            return FieldParser.parse(self, field, data)
        except ValueError:
            return InvalidValue(data)


debugmode=0         # Set to 1 to catch all the errors.            

for infile in os.listdir('.'):
    if fnmatch.fnmatch(infile, '*.dbf'):
        outfile = infile[:-4] + ".csv"
        print("Converting " + infile + " to " + outfile + ". Each period represents 2,000 records.")
        counter = 0
        starttime=time.clock()
        with open(outfile, 'w') as csvfile:
            table = DBF(infile, parserclass=MyFieldParser, ignore_missing_memofile=True)
            writer = csv.writer(csvfile)
            writer.writerow(table.field_names)
            for i, record in enumerate(table):
                for name, value in record.items():
                    if isinstance(value, InvalidValue):
                        if debugmode == 1:
                            print('records[{}][{!r}] == {!r}'.format(i, name, value))
                writer.writerow(list(record.values()))
                counter +=1
                if counter%100000==0:
                    sys.stdout.write('!' + '\r\n')
                    endtime=time.clock()
#                     print (str("{:,}".format(counter))) + " records in " + #str(endtime-starttime) + " seconds."
                elif counter%2000==0:
                    sys.stdout.write('.')
                else:
                    pass
        print("")
        endtime=time.clock()
        print ("Processed " + str("{:,}".format(counter)) + " records in " + str(endtime-starttime) + " seconds (" + str((endtime-starttime)/60) + " minutes.)")
        print (str(counter / (endtime-starttime)) + " records per second.")
        print("")

但是考虑到有超过400个子文件夹，此过程过于繁琐。

然后在命令提示符下输入 copy *.csv combine.csv，但也可以使用python完成。目前正在试验Os.Walk，但尚未取得任何重大进展。

理想情况下，输出应该是一个csv文件，其中包含每个单独状态的所有组合数据。

例如。

"\Datafiles\FL.csv"
"\Datafiles\NJ.csv"

如果每个状态的输出都进入熊猫数据框，那也没关系。

更新编辑：我能够使用os.walk将所有dbf文件转换为csv。 Os.walk还有助于向我提供包含dbf和csv文件的目录列表。例如

fl_dirs= ['\Datafiles\\01_APRIL_2019\\01_APRIL_2019\\FL',
 '\Datafiles\\01_JUly_2019\\01_JUlY_2019\\FL',
 '\Datafiles\\03_JUNE_2019\\03_JUNE_2019\\FL',
 '\Datafiles\\04_MARCH_2019\\04_MARCH_2019\\FL']

我只是想访问那些目录中的相同csv文件，然后将它们与python组合成一个csv文件。

更新：已解决！我编写了一个脚本，可以执行所需的所有操作！

Answer 1

可以使用os.walk（https://docs.python.org/3/library/os.html#os.listdir）简化此问题。

可以遍历子目录，并且可以根据状态将每个dbf文件的绝对路径附加到单独的列表中。

然后，可以使用函数dbf_to_csv将文件转换为csv，然后可以使用熊猫（https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html）中包含的concat功能进行组合。

编辑：以下代码可能会有所帮助。它未经测试。

import pandas as pd
import os

# basepath here
base_path="" 
#output dir here
output_path=""


#Create dictionary to store all absolute path
path_dict={"FL":[],"NJ":[]}

#recursively look up into base path
for abs_path,curr_dir,file_list in os.walk(base_path):
    if abs_path.endswith("FL"):
        path_dict["FL"].extend([os.path.join(abs_path,file) for file in file_list])
    elif abs_path.endswith ("NJ"):
        path_dict["NJ"].extend([os.path.join(abs_path,file) for file in file_list])

for paths in path_dict:
    df=pd.concat(
        [pd.read_csv(i) for i in set(path_dict[paths])],
        ignore_index=True
    )
    df.to_csv(os.path.join(output_path,paths+".csv"),index=False)

如何1.将4,550个dbf文件转换为csv文件2.根据名称连接文件3.将所有csv连接到一个大数据csv中进行分析？

1 个答案: