无法以CSV格式存储信息(Python网络抓取)

时间:2019-07-09 14:06:13

标签: python csv web-scraping beautifulsoup

我的代码未将结果正确存储到我创建的csv文件中。

我需要从U.S. Congress website中提取每个账单的号码,赞助商和聚会的数据。

在解释器中运行代码时,它可以正常工作并为我提供所需的结果。但是,在我创建的csv文件中,存在以下问题之一:

  • 每个帐单相同的赞助方(正确的帐单号,但所有人共享相同的赞助方)
SPONS  PARTY NBILL
Name   D     7402
Name   D     7401
...

有趣的是,我找到的名字(格里耶尔瓦(Grijalva),劳尔(Raul))对应于Bill 7302。

  • 正确的发起方,但是只有第100个帐单,即我每100个发起方都有7402; 7302,依此类推。

如上所述,赞助商和派对不同,但是账单数量每100个赞助商/派对对变化一次,并以100乘以100(前100个对为7402,第二对为7302,依此类推)

  • 正确的赞助方-但没有帐单,以下代码会发生这种情况

编辑:如果我将Congress=[-]+[-]+[-]放在代码的末尾,则属于第一种情况。

 with open('115congress.csv', 'w') as f:
        fwriter=csv.writer(f, delimiter=';')
        fwriter.writerow(['SPONS', 'PARTY', 'NBILL'])
        BillN=[]
        Spons=[]
        Party=[]
        for j in range(1, 114):
            hrurl='https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j)
            hrpage=requests.get(hrurl, headers=headers)
            data=hrpage.text
            soup=BeautifulSoup(data, 'lxml')
            for q in soup.findAll('span', {'class':'result-item'}):
                for a in q.findAll('a', href=True, text=True, target='_blank'):
                    secondindex=secondindex+1
                    if (secondindex/2).is_integer():
                        continue
                    Spons=a.text
                    print(Spons)
                    SPONS=Spons
                    if 'R' in Spons:
                        Party='Republican'
                    if 'D' in Spons:
                        Party='Democratic'
                    print(Party)
                    PARTY=Party
                    Congress115=[SPONS]+[PARTY]
                    fwriter.writerow(Congress115)
            for r in soup.findAll('span', {'class':'result-heading'}):
                index=index+1
                if (index/2).is_integer():
                    continue
                Bill=r.findNext('a')
                BillN=Bill.text
                print(BillN)
                NBILL=BillN
                Congress115= [SPONS]+[PARTY]+[NBILL]
                fwriter.writerow(Congress115)

    f.close()

如何解决写入CSV的代码,以免出现这些问题?

2 个答案:

答案 0 :(得分:1)

我不理解您对代码的所有疑问,因为我无法重现您的错误。但是,我认为您的代码存在一些问题,我想向您展示另一种可能的方法。

我认为您的主要错误之一是将变量多次写入csv文件。此外,如果您只在包含聚会缩写和名称的字符串中查找单个字符,就会得到有关聚会的许多错误条目。

假设您要从每个条目中提取bill_nrsponsparty,则可以执行以下操作(请参见代码中的注释):

import csv
import requests
from bs4 import BeautifulSoup

for j in range(1,114):
  hrurl=f'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page={j}'
  hrpage=requests.get(hrurl)
  data=hrpage.text
  soup=BeautifulSoup(data, 'html5lib')

  # get the main div, that contains all entries on the page
  main_div = soup.find('div', {'id':'main'})
  # every entry is within a <li> element
  all_li = main_div.findAll('li', {'class':'expanded'})

  # iterate over <li>-elements
  for li in all_li:
    # get BILL_NR
    bill_nr_raw = li.find('span', {'class':'result-heading'}).text
    # I assume only the first part is the Nr, so you could extract it with the following
    bill_nr = bill_nr_raw.split()[0]
    # get SPONS
    spons_raw = li.find('span', {'class':'result-item'})
    spons = spons_raw.find('a').text

    # get PARTY
    # check if the string starts with one of the following to ensure you pick the right party
    if spons.startswith('Rep'):
      party = 'Republican'
    elif spons.startswith('Dem'):
      party = 'Democratic'

    # put all the information you extracted from this single entry (=<li>-element) into a list and write that list (=one row) to the csv file
    entry = [bill_nr, spons, party]
    with open('output.csv', 'a') as out_file:
      out = csv.writer(out_file)
      out.writerow(entry)

请注意,仅在Python> 3.6中支持使用f字符串(在主循环的开头)。

答案 1 :(得分:1)

更好的方法是遍历其他元素,例如<li>,然后在其中找到所需的元素。

要获得共同赞助者,您首先需要通过检查数字来测试是否有共同赞助者。如果不是0,则首先获取指向子页面的链接。使用单独的BeautifulSoup对象请求此子页面。然后可以解析包含共同赞助者的表,并将所有共同赞助者添加到列表中。如果需要,您可以在此处添加额外的处理。然后将该列表合并为一个字符串,以将其保存到CSV文件的单个列中。

from bs4 import BeautifulSoup
import csv
import requests
import string

headers = None

with open('115congress.csv', 'w', newline='') as f:
    fwriter = csv.writer(f, delimiter=';')
    fwriter.writerow(['SPONS', 'PARTY', 'NBILL', 'TITLE', 'COSPONSORS'])

    for j in range(1, 3):  #114):
        print(f'Getting page {j}')

        hrurl = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j)
        hrpage = requests.get(hrurl, headers=headers)
        soup = BeautifulSoup(hrpage.content, 'lxml')

        for li in soup.find_all('li', class_='expanded'):
            bill_or_law = li.span.text
            sponsor = li.find('span', class_='result-item').a.text
            title = li.find('span', class_='result-title').text
            nbill = li.find('a').text.strip(string.ascii_uppercase + ' .')

            if '[R' in sponsor:
                party = 'Republican'
            elif '[D' in sponsor:
                party = 'Democratic'
            else:
                party = 'Unknown'

            # Any cosponsors?
            cosponsor_link = li.find_all('a')[2]

            if cosponsor_link.text == '0':
                cosponsors = "No cosponsors"
            else:
                print(f'Getting cosponsors for {sponsor}')
                # Get the subpage containing the cosponsors
                hr_cosponsors = requests.get(cosponsor_link['href'], headers=headers)
                soup_cosponsors = BeautifulSoup(hr_cosponsors.content, 'lxml')
                table = soup_cosponsors.find('table', class_="item_table")

                # Create a list of the cosponsors
                cosponsor_list = []

                for tr in table.tbody.find_all('tr'):
                    cosponsor_list.append(tr.td.a.text)

                # Join them together into a single string
                cosponsors = ' - '.join(cosponsor_list)

            fwriter.writerow([sponsor, party, nbill, f'{bill_or_law} - {title}', cosponsors])

开始为您提供输出CSV文件:

SPONS;PARTY;NBILL;TITLE;COSPONSORS
Rep. Ellison, Keith [D-MN-5];Democratic;7401;BILL - Strengthening Refugee Resettlement Act;No cosponsors
Rep. Wild, Susan [D-PA-15];Democratic;7400;BILL - Making continuing appropriations for the Coast Guard.;No cosponsors
Rep. Scanlon, Mary Gay [D-PA-7];Democratic;7399;BILL - Inaugural Fund Integrity Act;No cosponsors
Rep. Foster, Bill [D-IL-11];Democratic;7398;BILL - SPA Act;No cosponsors
Rep. Hoyer, Steny H. [D-MD-5];Democratic;7397;BILL - To provide further additional continuing appropriations for fiscal year 2019, and for other purposes.;No cosponsors
Rep. Torres, Norma J. [D-CA-35];Democratic;7396;BILL - Border Security and Child Safety Act;Rep. Vargas, Juan [D-CA-51]* - Rep. McGovern, James P. [D-MA-2]*
Rep. Meadows, Mark [R-NC-11];Republican;7395;BILL - To direct the Secretary of Health and Human Services to allow delivery of medical supplies by unmanned aerial systems, and for other purposes.;No cosponsors
Rep. Luetkemeyer, Blaine [R-MO-3];Republican;7394;"BILL - To prohibit the Federal financial regulators from requiring compliance with the accounting standards update of the Financial Accounting Standards Board related to current expected credit loss (""CECL""), to require the Securities and Exchange Commission to take certain impacts of a proposed accounting principle into consideration before accepting the principle, and for other purposes.";Rep. Budd, Ted [R-NC-13]*
Rep. Faso, John J. [R-NY-19];Republican;7393;BILL - Medicaid Quality Care Act;No cosponsors
Rep. Babin, Brian [R-TX-36];Republican;7392;BILL - TRACED Act;No cosponsors
Rep. Arrington, Jodey C. [R-TX-19];Republican;7391;BILL - Rural Hospital Freedom and Flexibility Act of 2018;No cosponsors
Rep. Jackson Lee, Sheila [D-TX-18];Democratic;7390;BILL - Violence Against Women Extension Act of 2018;Rep. Hoyer, Steny H. [D-MD-5] - Rep. Clyburn, James E. [D-SC-6]

使用csv.writer()时,应始终使用newline=''参数打开文件。这样可以避免在CSV文件中出现双倍行距。

我建议在文本中搜索[D[R,因为在其余文本中可能已经有DR