python 3.X将压缩的csv文件连接到一个非压缩的csv文件

时间:2017-09-17 21:34:32

标签: python-3.x csv

这是我的python 3代码:

import zipfile
import os
import time
from timeit import default_timer as timer
import re
import glob
import pandas as pd


# local variabless
# pc version
# the_dir = r'c:\ImpExpData'
# linux version
the_dir = '/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95'


def main():
    """
    this is the function that controls the processing
    """
    start_time = timer()
    for root, dirs, files in os.walk(the_dir):
        for file in files:
            if file.endswith(".zip"):
                print("working dir is ...", the_dir)
                zipPath = os.path.join(root, file)
                z = zipfile.ZipFile(zipPath, "r")
                for filename in z.namelist():
                    if filename.endswith(".csv"):
                        # print filename
                        if re.match(r'^Trade-Geo.*\.csv$', filename):
                            pass  #  do somethin with geo file
                        # print " Geo data:  " , filename
                        elif re.match(r'^Trade-Metadata.*\.csv$', filename):
                            pass  # do something with metadata file
                        # print "Metadata:    ", filename
                        else:
                            try:
                                with zipfile.ZipFile(zipPath) as z:
                                    with z.open(filename) as f:
                                        # print("send to test def...", filename)
                                        # print(zipPath)
                                        with zipfile.ZipFile(zipPath) as z:
                                            with z.open(filename) as f:
                                                frame = pd.DataFrame()
                                                # EmptyDataError: No columns to parse from file -- how to deal with this error
                                                train_df = read_csv(f, index_col=None, header=0, skiprows=1, encoding="cp1252")
                                                # train_df = pd.read_csv(f, header=0, skiprows=1, delimiter=",", encoding="cp1252")
                                                list_ = []
                                                list_.append(train_df)
                                                # print(list_)
                                                frame = pd.concat(list_, ignore_index=True)
                                                frame.to_csv('/home/ralph/Documents/lulumcusb/ImpExpData/Exports/concat_test.csv', encoding='cp1252')   # works
                            except:  # catches EmptyDataError: No columns to parse from file
                                print("EmptyDataError...." ,filename, "...", zipPath)

#    GetSubDirList(the_dir)
    end_time = timer()
    print("Elapsed time was %g seconds" % (end_time - start_time))


if __name__ == '__main__':
    main()

它主要起作用 - 只是它不会将所有压缩的csv文件连接成一个。有一个空文件,所有csv文件都具有相同的字段结构,所有csv文件的行数各不相同。

这是spyder在我运行时报告的内容:

runfile('/home/ralph/Documents/lulumcusb/Sep15_cocncatCSV.py', wdir='/home/ralph/Documents/lulumcusb')

working dir is ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95

EmptyDataError.... Trade-Exports-Chp-77.csv ... /home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95/Trade-Exports-Yr1992-1995.zip

/home/ralph/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py:688: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  execfile(filename, namespace)

Elapsed time was 104.857 seconds

最终的csvfile是最后处理的压缩csv文件; csv文件在处理文件时会改变大小

压缩文件中有99个csv文件,我想连接成一个非压缩的csv文件

字段或列名称是: colmNames = [“hs_code”,“uom”,“country”,“state”,“prov”,“value”,“quatity”,“year”,“month”]

将csvfiles标记为:chp01.csv,cht02.csv等,以chp99.csv为单位,其中“uom”(度量单位)为空,或者是整数或字符串,具体取决于hs_code

问题:如何将压缩的csv文件连接成一个大的(估计100 mb未压缩的)csv文件?

添加了详细信息: 我试图不解压缩csv文件,然后我必须删除它们。我需要连接文件,因为我有额外的处理要做。提取压缩的csv文件是一个可行的选择,我希望不必这样做

1 个答案:

答案 0 :(得分:0)

你有什么理由不想用你的shell做这个吗?

假设你连接的顺序无关紧要:

cd "/home/ralph/Documents/lulumcusb/ImpExpData/Exports/92-95"
unzip "Trade-Exports-Yr1992-1995.zip" -d unzipped && cd unzipped
for f in Trade-Exports-Chp*.csv; do tail --lines=+2 "$f" >> concat.csv; done

这会在追加到concat.csv之前从每个csv文件中删除第一行(列名称)。

如果你刚刚做了:

tail --lines=+2 "Trade-Exports-Chp*.csv" > concat.csv

你最终得到:

==> Trade-Exports-Chp-1.csv <==
...

==> Trade-Exports-Chp-10.csv <==
...

==> Trade-Exports-Chp-2.csv <==
...

etc.

如果您关心订单,请将Trade-Exports-Chp-1.csv .. Trade-Exports-Chp-9.csv更改为Trade-Exports-Chp-01.csv .. Trade-Exports-Chp-09.csv

虽然它在Python中可行但我认为在这种情况下它不适合这项工作。

如果您想在不实际提取zip文件的情况下完成工作:

for i in {1..99}; do
  unzip -p "Trade-Exports-Yr1992-1995.zip" "Trade-Exports-Chp$i.csv" | tail --lines=+2 >> concat.csv
done