我的代码未将结果正确存储到我创建的csv文件中。
我需要从U.S. Congress website中提取每个账单的号码,赞助商和聚会的数据。
在解释器中运行代码时,它可以正常工作并为我提供所需的结果。但是,在我创建的csv文件中,存在以下问题之一:
SPONS PARTY NBILL
Name D 7402
Name D 7401
...
有趣的是,我找到的名字(格里耶尔瓦(Grijalva),劳尔(Raul))对应于Bill 7302。
如上所述,赞助商和派对不同,但是账单数量每100个赞助商/派对对变化一次,并以100乘以100(前100个对为7402,第二对为7302,依此类推)
编辑:如果我将Congress=[-]+[-]+[-]
放在代码的末尾,则属于第一种情况。
with open('115congress.csv', 'w') as f:
fwriter=csv.writer(f, delimiter=';')
fwriter.writerow(['SPONS', 'PARTY', 'NBILL'])
BillN=[]
Spons=[]
Party=[]
for j in range(1, 114):
hrurl='https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j)
hrpage=requests.get(hrurl, headers=headers)
data=hrpage.text
soup=BeautifulSoup(data, 'lxml')
for q in soup.findAll('span', {'class':'result-item'}):
for a in q.findAll('a', href=True, text=True, target='_blank'):
secondindex=secondindex+1
if (secondindex/2).is_integer():
continue
Spons=a.text
print(Spons)
SPONS=Spons
if 'R' in Spons:
Party='Republican'
if 'D' in Spons:
Party='Democratic'
print(Party)
PARTY=Party
Congress115=[SPONS]+[PARTY]
fwriter.writerow(Congress115)
for r in soup.findAll('span', {'class':'result-heading'}):
index=index+1
if (index/2).is_integer():
continue
Bill=r.findNext('a')
BillN=Bill.text
print(BillN)
NBILL=BillN
Congress115= [SPONS]+[PARTY]+[NBILL]
fwriter.writerow(Congress115)
f.close()
如何解决写入CSV的代码,以免出现这些问题?
答案 0 :(得分:1)
我不理解您对代码的所有疑问,因为我无法重现您的错误。但是,我认为您的代码存在一些问题,我想向您展示另一种可能的方法。
我认为您的主要错误之一是将变量多次写入csv文件。此外,如果您只在包含聚会缩写和名称的字符串中查找单个字符,就会得到有关聚会的许多错误条目。
假设您要从每个条目中提取bill_nr
,spons
和party
,则可以执行以下操作(请参见代码中的注释):
import csv
import requests
from bs4 import BeautifulSoup
for j in range(1,114):
hrurl=f'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page={j}'
hrpage=requests.get(hrurl)
data=hrpage.text
soup=BeautifulSoup(data, 'html5lib')
# get the main div, that contains all entries on the page
main_div = soup.find('div', {'id':'main'})
# every entry is within a <li> element
all_li = main_div.findAll('li', {'class':'expanded'})
# iterate over <li>-elements
for li in all_li:
# get BILL_NR
bill_nr_raw = li.find('span', {'class':'result-heading'}).text
# I assume only the first part is the Nr, so you could extract it with the following
bill_nr = bill_nr_raw.split()[0]
# get SPONS
spons_raw = li.find('span', {'class':'result-item'})
spons = spons_raw.find('a').text
# get PARTY
# check if the string starts with one of the following to ensure you pick the right party
if spons.startswith('Rep'):
party = 'Republican'
elif spons.startswith('Dem'):
party = 'Democratic'
# put all the information you extracted from this single entry (=<li>-element) into a list and write that list (=one row) to the csv file
entry = [bill_nr, spons, party]
with open('output.csv', 'a') as out_file:
out = csv.writer(out_file)
out.writerow(entry)
请注意,仅在Python> 3.6中支持使用f字符串(在主循环的开头)。
答案 1 :(得分:1)
更好的方法是遍历其他元素,例如<li>
,然后在其中找到所需的元素。
要获得共同赞助者,您首先需要通过检查数字来测试是否有共同赞助者。如果不是0
,则首先获取指向子页面的链接。使用单独的BeautifulSoup对象请求此子页面。然后可以解析包含共同赞助者的表,并将所有共同赞助者添加到列表中。如果需要,您可以在此处添加额外的处理。然后将该列表合并为一个字符串,以将其保存到CSV文件的单个列中。
from bs4 import BeautifulSoup
import csv
import requests
import string
headers = None
with open('115congress.csv', 'w', newline='') as f:
fwriter = csv.writer(f, delimiter=';')
fwriter.writerow(['SPONS', 'PARTY', 'NBILL', 'TITLE', 'COSPONSORS'])
for j in range(1, 3): #114):
print(f'Getting page {j}')
hrurl = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j)
hrpage = requests.get(hrurl, headers=headers)
soup = BeautifulSoup(hrpage.content, 'lxml')
for li in soup.find_all('li', class_='expanded'):
bill_or_law = li.span.text
sponsor = li.find('span', class_='result-item').a.text
title = li.find('span', class_='result-title').text
nbill = li.find('a').text.strip(string.ascii_uppercase + ' .')
if '[R' in sponsor:
party = 'Republican'
elif '[D' in sponsor:
party = 'Democratic'
else:
party = 'Unknown'
# Any cosponsors?
cosponsor_link = li.find_all('a')[2]
if cosponsor_link.text == '0':
cosponsors = "No cosponsors"
else:
print(f'Getting cosponsors for {sponsor}')
# Get the subpage containing the cosponsors
hr_cosponsors = requests.get(cosponsor_link['href'], headers=headers)
soup_cosponsors = BeautifulSoup(hr_cosponsors.content, 'lxml')
table = soup_cosponsors.find('table', class_="item_table")
# Create a list of the cosponsors
cosponsor_list = []
for tr in table.tbody.find_all('tr'):
cosponsor_list.append(tr.td.a.text)
# Join them together into a single string
cosponsors = ' - '.join(cosponsor_list)
fwriter.writerow([sponsor, party, nbill, f'{bill_or_law} - {title}', cosponsors])
开始为您提供输出CSV文件:
SPONS;PARTY;NBILL;TITLE;COSPONSORS
Rep. Ellison, Keith [D-MN-5];Democratic;7401;BILL - Strengthening Refugee Resettlement Act;No cosponsors
Rep. Wild, Susan [D-PA-15];Democratic;7400;BILL - Making continuing appropriations for the Coast Guard.;No cosponsors
Rep. Scanlon, Mary Gay [D-PA-7];Democratic;7399;BILL - Inaugural Fund Integrity Act;No cosponsors
Rep. Foster, Bill [D-IL-11];Democratic;7398;BILL - SPA Act;No cosponsors
Rep. Hoyer, Steny H. [D-MD-5];Democratic;7397;BILL - To provide further additional continuing appropriations for fiscal year 2019, and for other purposes.;No cosponsors
Rep. Torres, Norma J. [D-CA-35];Democratic;7396;BILL - Border Security and Child Safety Act;Rep. Vargas, Juan [D-CA-51]* - Rep. McGovern, James P. [D-MA-2]*
Rep. Meadows, Mark [R-NC-11];Republican;7395;BILL - To direct the Secretary of Health and Human Services to allow delivery of medical supplies by unmanned aerial systems, and for other purposes.;No cosponsors
Rep. Luetkemeyer, Blaine [R-MO-3];Republican;7394;"BILL - To prohibit the Federal financial regulators from requiring compliance with the accounting standards update of the Financial Accounting Standards Board related to current expected credit loss (""CECL""), to require the Securities and Exchange Commission to take certain impacts of a proposed accounting principle into consideration before accepting the principle, and for other purposes.";Rep. Budd, Ted [R-NC-13]*
Rep. Faso, John J. [R-NY-19];Republican;7393;BILL - Medicaid Quality Care Act;No cosponsors
Rep. Babin, Brian [R-TX-36];Republican;7392;BILL - TRACED Act;No cosponsors
Rep. Arrington, Jodey C. [R-TX-19];Republican;7391;BILL - Rural Hospital Freedom and Flexibility Act of 2018;No cosponsors
Rep. Jackson Lee, Sheila [D-TX-18];Democratic;7390;BILL - Violence Against Women Extension Act of 2018;Rep. Hoyer, Steny H. [D-MD-5] - Rep. Clyburn, James E. [D-SC-6]
使用csv.writer()
时,应始终使用newline=''
参数打开文件。这样可以避免在CSV文件中出现双倍行距。
我建议在文本中搜索[D
或[R
,因为在其余文本中可能已经有D
或R
。