我有几个大的.text文件,我想将它们合并为一个.csv文件。但是,每个文件太大,无法单独导入Excel,更不用说全部在一起了。
我想创建并使用熊猫来分析数据,但不知道如何将文件全部放在一个地方。
我该如何直接将数据读入Python或Excel中的.csv文件?
有问题的数据是FEC网站上的2019-2020 Contributions by individuals文件。
*而且我使用的是PC,而不是Mac
答案 0 :(得分:1)
如果我理解的正确,您想从URL下载所有.zip文件,将它们组合到一个数据帧中,然后将其保存到csv(此示例使用BeautifulSoup
获取.zip文件的所有URL):
import pandas
import requests
from io import BytesIO
from zipfile import ZipFile
from bs4 import BeautifulSoup
url = 'https://www.fec.gov/data/browse-data/?tab=bulk-data'
names = [
'CAND_ID',
'CAND_NAME',
'CAND_ICI',
'PTY_CD',
'CAND_PTY_AFFILIATION',
'TTL_RECEIPTS',
'TRANS_FROM_AUTH',
'TTL_DISB',
'TRANS_TO_AUTH',
'COH_BOP',
'COH_COP',
'CAND_CONTRIB',
'CAND_LOANS',
'OTHER_LOANS',
'CAND_LOAN_REPAY',
'OTHER_LOAN_REPAY',
'DEBTS_OWED_BY',
'TTL_INDIV_CONTRIB',
'CAND_OFFICE_ST',
'CAND_OFFICE_DISTRICT',
'SPEC_ELECTION',
'PRIM_ELECTION',
'RUN_ELECTION',
'GEN_ELECTION',
'GEN_ELECTION_PRECENT',
'OTHER_POL_CMTE_CONTRIB',
'POL_PTY_CONTRIB',
'CVG_END_DT',
'INDIV_REFUNDS',
'CMTE_REFUNDS'
]
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
df = pandas.DataFrame([], columns=names)
for a in soup.select_one('button:contains("All candidates")').find_next('ul').select('a'):
zipfile_url = 'https://www.fec.gov' + a['href']
zf = ZipFile(BytesIO(requests.get(zipfile_url).content))
for item in zf.namelist():
print("File in zip: " + item)
if '.txt' in item:
in_df = pandas.read_csv(zf.open(item), sep='|', header=None, names=names)
df = df.append(in_df, ignore_index=True)
print(df)
# `df` now includes 56928 rows of data, save it to csv
df.to_csv('candidates.csv', index=False)
# ...or make other operations on this dataframe
此打印:
File in zip: weball80.txt
CAND_ID CAND_NAME CAND_ICI PTY_CD CAND_PTY_AFFILIATION TTL_RECEIPTS ... GEN_ELECTION_PRECENT OTHER_POL_CMTE_CONTRIB POL_PTY_CONTRIB CVG_END_DT INDIV_REFUNDS CMTE_REFUNDS
0 H8AK00132 SHEIN, DIMITRI C 1 DEM 0.00 ... NaN 0.00 0.0 09/30/2019 0.00 0.0
1 H6AK00045 YOUNG, DONALD E I 2 REP 571389.12 ... NaN 263194.63 0.0 09/30/2019 0.00 2000.0
2 H8AK01031 NELSON, THOMAS JOHN C 2 REP 0.00 ... NaN 0.00 0.0 03/31/2019 0.00 0.0
3 H8AK00140 GALVIN, ALYSE C 3 IND 497774.71 ... NaN 500.00 0.0 09/30/2019 1038.19 0.0
4 H0AL01097 AVERHART, JAMES O 1 DEM 22725.13 ... NaN 0.00 0.0 09/30/2019 0.00 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
56923 S2WY00018 HANSEN, CLIFFORD P. C 2 REP 0.00 ... NaN 0.00 0.0 03/31/1979 0.00 0.0
56924 S6WY00043 WALLOP, MALCOLM I 2 REP 36352.00 ... NaN 0.00 0.0 12/31/1980 0.00 0.0
56925 S8WY00015 BINFORD, HUGH L. C 2 REP 262047.00 ... NaN 0.00 0.0 04/11/1980 0.00 0.0
56926 S8WY00023 SIMPSON, ALAN KOOI I 2 REP 150447.00 ... NaN 0.00 0.0 12/31/1980 0.00 0.0
56927 S8WY00056 BARROWS, GORDON HENSLEY C 2 REP 0.00 ... NaN 0.00 0.0 06/30/1979 0.00 0.0
[56928 rows x 30 columns]
并将数据保存到candidates.csv
。
编辑:阅读您的问题后,此代码段将仅加载2019-2020年的贡献,并将其存储到一个大 .csv文件中:
import pandas
import requests
from io import BytesIO
from zipfile import ZipFile
from bs4 import BeautifulSoup
url = 'https://www.fec.gov/data/browse-data/?tab=bulk-data'
names = ['CMTE_ID','AMNDT_IND','RPT_TP','TRANSACTION_PGI','IMAGE_NUM','TRANSACTION_TP','ENTITY_TP','NAME','CITY','STATE','ZIP_CODE','EMPLOYER','OCCUPATION','TRANSACTION_DT','TRANSACTION_AMT','OTHER_ID','TRAN_ID','FILE_NUM','MEMO_CD','MEMO_TEXT','SUB_ID']
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
df = pandas.DataFrame([], columns=names)
df.to_csv('contributions.csv', mode='w', index=False)
for a in soup.select_one('button:contains("Contributions by individuals")').find_next('ul').select('a:contains("2019–2020")'):
zipfile_url = 'https://www.fec.gov' + a['href']
zf = ZipFile(BytesIO(requests.get(zipfile_url).content))
for item in zf.namelist():
print("File in zip: " + item)
if '.txt' in item:
in_df = pandas.read_csv(zf.open(item), sep='|', header=None, names=names, low_memory=False)
in_df.to_csv('contributions.csv', mode='a', header=False, index=False)
print(in_df)
结果是文件contributions.csv
中有14978701
行。
此后,我将数据导入了Pandas(但是已经关闭了-我的PC上有16GB的内存):
import pandas
df = pandas.read_csv('contributions.csv')
print(df)
打印:
sys:1: DtypeWarning: Columns (3,5,10,11,12,13,14,15,16,17,18,19,20) have mixed types. Specify dtype option on import or set low_memory=False.
CMTE_ID AMNDT_IND RPT_TP TRANSACTION_PGI IMAGE_NUM TRANSACTION_TP ... OTHER_ID TRAN_ID FILE_NUM MEMO_CD MEMO_TEXT SUB_ID
0 C00432906 T TER P2018 201901219143901218 22Y ... NaN SB20A.55756 1305860 NaN NaN 4021320191639407455
1 C00432906 T TER P2018 201901219143901218 22Y ... NaN SB20A.55755 1305860 NaN NaN 4021320191639407453
2 C00638478 T TER P2018 201901289144040159 15C ... H8CA39133 3703295 1307800 NaN CONVERTING PRIMARY LOAN TO CONTRIBUTION 4021220191639267648
3 C00640870 T TER P2018 201901259144002482 15C ... H8FL07054 VTQYWHKD8W6 1307204 NaN CONTRIBUTION FOR DEBT RETIREMENT 4021320191639532337
4 C00638478 T TER P2018 201901289144040158 15 ... NaN 3703278 1307800 NaN CHECK LOST 4021220191639267645
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14939961 C00437244 N M3 P 201903080300269078 15 ... NaN SA031819907833 1319643 NaN NaN 2031820191645160755
14939962 C00365973 N Q1 P 201904160300273926 15 ... NaN SA04191939261 1327732 NaN NaN 2042220191647061196
14939963 C00365973 N Q1 P 201904160300273926 15 ... NaN SA04191939262 1327732 NaN NaN 2042220191647061197
14939964 C00365973 N Q1 P 201904160300273926 15 ... NaN SA04191939263 1327732 NaN NaN 2042220191647061198
14939965 C00365973 N Q1 P 201904160300273927 15 ... NaN SA04191939274 1327732 NaN NaN 2042220191647061199
[14939966 rows x 21 columns]
答案 1 :(得分:0)
使用Unix真的很简单。我建议在Windows PC上安装类似Unix的环境,然后使用cat
命令:
cat file1.text file2.text file99.text > new_file.csv
或者,结合下载:
wget URL
unzip archive.zip
cd folder
cat *.text > all.csv
答案 2 :(得分:0)
您可以先将每个文件转换为csv,然后将它们串联起来以形成一个最终的csv。
import pandas as pd
import urllib
csv_path = 'pathtonewcsvfolder' # use your path
all_files=os.listdir("path/to/textfiles")
urllib.request.urlretrieve("url/to/files", "all_files.zip")
import zipfile
with zipfile.ZipFile(path_to_all_files.zip, 'r') as zip_ref:
zip_ref.extractall(all_files)
x=0
for filename in all_files:
df = pd.read_fwf(filename)
df.to_csv(os.path.join(csv_path,'log'+str(x)+'.csv'))
x+=1
all_csv_files = glob.iglob(os.path.join(csv_path, "*.csv"))
converted_df=pd.concat((pd.read_csv(f) for f in all_csv_files), ignore_index=True)
converted_df.to_csv('final.csv')