如何将大型.text文件合并为单个.csv?

时间:2019-12-18 17:38:15

标签: python csv export-to-csv

我有几个大的.text文件,我想将它们合并为一个.csv文件。但是,每个文件太大,无法单独导入Excel,更不用说全部在一起了。

我想创建并使用熊猫来分析数据,但不知道如何将文件全部放在一个地方。

我该如何直接将数据读入Python或Excel中的.csv文件?

有问题的数据是FEC网站上的2019-2020 Contributions by individuals文件。

*而且我使用的是PC,而不是Mac

3 个答案:

答案 0 :(得分:1)

如果我理解的正确,您想从URL下载所有.zip文件,将它们组合到一个数据帧中,然后将其保存到csv(此示例使用BeautifulSoup获取.zip文件的所有URL):

import pandas
import requests
from io import BytesIO
from zipfile import ZipFile
from bs4 import BeautifulSoup

url = 'https://www.fec.gov/data/browse-data/?tab=bulk-data'

names = [
'CAND_ID',
'CAND_NAME',
'CAND_ICI',
'PTY_CD',
'CAND_PTY_AFFILIATION',
'TTL_RECEIPTS',
'TRANS_FROM_AUTH',
'TTL_DISB',
'TRANS_TO_AUTH',
'COH_BOP',
'COH_COP',
'CAND_CONTRIB',
'CAND_LOANS',
'OTHER_LOANS',
'CAND_LOAN_REPAY',
'OTHER_LOAN_REPAY',
'DEBTS_OWED_BY',
'TTL_INDIV_CONTRIB',
'CAND_OFFICE_ST',
'CAND_OFFICE_DISTRICT',
'SPEC_ELECTION',
'PRIM_ELECTION',
'RUN_ELECTION',
'GEN_ELECTION',
'GEN_ELECTION_PRECENT',
'OTHER_POL_CMTE_CONTRIB',
'POL_PTY_CONTRIB',
'CVG_END_DT',
'INDIV_REFUNDS',
'CMTE_REFUNDS'
]

soup = BeautifulSoup(requests.get(url).text, 'html5lib')

df = pandas.DataFrame([], columns=names)

for a in soup.select_one('button:contains("All candidates")').find_next('ul').select('a'):
    zipfile_url = 'https://www.fec.gov' + a['href']
    zf = ZipFile(BytesIO(requests.get(zipfile_url).content))
    for item in zf.namelist():
        print("File in zip: " + item)
        if '.txt' in item:
            in_df = pandas.read_csv(zf.open(item), sep='|', header=None, names=names)
            df = df.append(in_df, ignore_index=True)
            print(df)

# `df` now includes 56928 rows of data, save it to csv
df.to_csv('candidates.csv', index=False)

# ...or make other operations on this dataframe

此打印:

File in zip: weball80.txt
         CAND_ID                CAND_NAME CAND_ICI PTY_CD CAND_PTY_AFFILIATION  TTL_RECEIPTS  ...  GEN_ELECTION_PRECENT  OTHER_POL_CMTE_CONTRIB  POL_PTY_CONTRIB  CVG_END_DT  INDIV_REFUNDS  CMTE_REFUNDS
0      H8AK00132           SHEIN, DIMITRI        C      1                  DEM          0.00  ...                   NaN                    0.00              0.0  09/30/2019           0.00           0.0
1      H6AK00045          YOUNG, DONALD E        I      2                  REP     571389.12  ...                   NaN               263194.63              0.0  09/30/2019           0.00        2000.0
2      H8AK01031      NELSON, THOMAS JOHN        C      2                  REP          0.00  ...                   NaN                    0.00              0.0  03/31/2019           0.00           0.0
3      H8AK00140            GALVIN, ALYSE        C      3                  IND     497774.71  ...                   NaN                  500.00              0.0  09/30/2019        1038.19           0.0
4      H0AL01097          AVERHART, JAMES        O      1                  DEM      22725.13  ...                   NaN                    0.00              0.0  09/30/2019           0.00           0.0
...          ...                      ...      ...    ...                  ...           ...  ...                   ...                     ...              ...         ...            ...           ...
56923  S2WY00018      HANSEN, CLIFFORD P.        C      2                  REP          0.00  ...                   NaN                    0.00              0.0  03/31/1979           0.00           0.0
56924  S6WY00043          WALLOP, MALCOLM        I      2                  REP      36352.00  ...                   NaN                    0.00              0.0  12/31/1980           0.00           0.0
56925  S8WY00015         BINFORD, HUGH L.        C      2                  REP     262047.00  ...                   NaN                    0.00              0.0  04/11/1980           0.00           0.0
56926  S8WY00023       SIMPSON, ALAN KOOI        I      2                  REP     150447.00  ...                   NaN                    0.00              0.0  12/31/1980           0.00           0.0
56927  S8WY00056  BARROWS, GORDON HENSLEY        C      2                  REP          0.00  ...                   NaN                    0.00              0.0  06/30/1979           0.00           0.0

[56928 rows x 30 columns]

并将数据保存到candidates.csv


编辑:阅读您的问题后,此代码段将仅加载2019-2020年的贡献,并将其存储到一个 .csv文件中:

import pandas
import requests
from io import BytesIO
from zipfile import ZipFile
from bs4 import BeautifulSoup

url = 'https://www.fec.gov/data/browse-data/?tab=bulk-data'

names = ['CMTE_ID','AMNDT_IND','RPT_TP','TRANSACTION_PGI','IMAGE_NUM','TRANSACTION_TP','ENTITY_TP','NAME','CITY','STATE','ZIP_CODE','EMPLOYER','OCCUPATION','TRANSACTION_DT','TRANSACTION_AMT','OTHER_ID','TRAN_ID','FILE_NUM','MEMO_CD','MEMO_TEXT','SUB_ID']

soup = BeautifulSoup(requests.get(url).text, 'html5lib')

df = pandas.DataFrame([], columns=names)
df.to_csv('contributions.csv', mode='w', index=False)

for a in soup.select_one('button:contains("Contributions by individuals")').find_next('ul').select('a:contains("2019–2020")'):
    zipfile_url = 'https://www.fec.gov' + a['href']
    zf = ZipFile(BytesIO(requests.get(zipfile_url).content))
    for item in zf.namelist():
        print("File in zip: " + item)
        if '.txt' in item:
            in_df = pandas.read_csv(zf.open(item), sep='|', header=None, names=names, low_memory=False)
            in_df.to_csv('contributions.csv', mode='a', header=False, index=False)
            print(in_df)

结果是文件contributions.csv中有14978701行。

此后,我将数据导入了Pandas(但是已经关闭了-我的PC上有16GB的内存):

import pandas

df = pandas.read_csv('contributions.csv')
print(df)

打印:

sys:1: DtypeWarning: Columns (3,5,10,11,12,13,14,15,16,17,18,19,20) have mixed types. Specify dtype option on import or set low_memory=False.
            CMTE_ID AMNDT_IND RPT_TP TRANSACTION_PGI           IMAGE_NUM TRANSACTION_TP  ...   OTHER_ID         TRAN_ID FILE_NUM MEMO_CD                                MEMO_TEXT               SUB_ID
0         C00432906         T    TER           P2018  201901219143901218            22Y  ...        NaN     SB20A.55756  1305860     NaN                                      NaN  4021320191639407455
1         C00432906         T    TER           P2018  201901219143901218            22Y  ...        NaN     SB20A.55755  1305860     NaN                                      NaN  4021320191639407453
2         C00638478         T    TER           P2018  201901289144040159            15C  ...  H8CA39133         3703295  1307800     NaN  CONVERTING PRIMARY LOAN TO CONTRIBUTION  4021220191639267648
3         C00640870         T    TER           P2018  201901259144002482            15C  ...  H8FL07054     VTQYWHKD8W6  1307204     NaN         CONTRIBUTION FOR DEBT RETIREMENT  4021320191639532337
4         C00638478         T    TER           P2018  201901289144040158             15  ...        NaN         3703278  1307800     NaN                               CHECK LOST  4021220191639267645
...             ...       ...    ...             ...                 ...            ...  ...        ...             ...      ...     ...                                      ...                  ...
14939961  C00437244         N     M3               P  201903080300269078             15  ...        NaN  SA031819907833  1319643     NaN                                      NaN  2031820191645160755
14939962  C00365973         N     Q1               P  201904160300273926             15  ...        NaN   SA04191939261  1327732     NaN                                      NaN  2042220191647061196
14939963  C00365973         N     Q1               P  201904160300273926             15  ...        NaN   SA04191939262  1327732     NaN                                      NaN  2042220191647061197
14939964  C00365973         N     Q1               P  201904160300273926             15  ...        NaN   SA04191939263  1327732     NaN                                      NaN  2042220191647061198
14939965  C00365973         N     Q1               P  201904160300273927             15  ...        NaN   SA04191939274  1327732     NaN                                      NaN  2042220191647061199

[14939966 rows x 21 columns]

答案 1 :(得分:0)

使用Unix真的很简单。我建议在Windows PC上安装类似Unix的环境,然后使用cat命令:

cat file1.text file2.text file99.text > new_file.csv

或者,结合下载:

wget URL
unzip archive.zip
cd folder
cat *.text > all.csv

答案 2 :(得分:0)

您可以先将每个文件转换为csv,然后将它们串联起来以形成一个最终的csv。

import pandas as pd
import urllib
csv_path = 'pathtonewcsvfolder'                    # use your path
all_files=os.listdir("path/to/textfiles")
urllib.request.urlretrieve("url/to/files", "all_files.zip")
import zipfile
with zipfile.ZipFile(path_to_all_files.zip, 'r') as zip_ref:
    zip_ref.extractall(all_files)
x=0
for filename in all_files:
    df = pd.read_fwf(filename)
    df.to_csv(os.path.join(csv_path,'log'+str(x)+'.csv'))
    x+=1
all_csv_files = glob.iglob(os.path.join(csv_path, "*.csv"))

converted_df=pd.concat((pd.read_csv(f) for f in all_csv_files), ignore_index=True)
converted_df.to_csv('final.csv')