Web抓取:遍历结果页面和表格行

时间:2016-12-21 05:27:42

标签: python python-3.x csv web-scraping beautifulsoup

我很欣赏有关Python / beautifulSoup / scraping的所有问题和答案,但我没有看到很多关于这种情况的事情而且我被卡住了。目前,我的代码可以成功遍历搜索结果的页面并创建一个csv文档,但是当它涉及到每个单独的表时,它只会复制第一行,然后再转到下一个结果页面。

例如,this page。目前,我的输出如下:

Brian Benoit,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Medicine Hat--Cardston--Warner',Nikolai Punko

它应该是这样的:

Brian Benoit,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Medicine Hat--Cardston--Warner',Nikolai Punko
Paul Hinman,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Welling, Alberta',Robert B. Barfuss
Michael Jones,25-Jun-16,Conservative,12-May-16,25-Jun-16,Medicine Hat--Cardston--Warner,b'Raymond, Alberta',Dawn M. Hamon 

(等表中的所有行。)

我的问题是:在继续下一个结果页面之前,如何让它循环并抓取每个行?谢谢。

这是我的代码:

from bs4 import BeautifulSoup
import requests
import re
import csv


url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"

with open('scrapeAllRows.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)

    for i in range(1, 56):
        print(i)
        r  = requests.get(url.format(i))
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        links = []

        for link in soup.find_all('a', href=re.compile('selectedid=')):
            links.append("http://www.elections.ca" + link.get('href'))

        for link in links:
            r  = requests.get(link)
            data = r.text
            cat = BeautifulSoup(data, "html.parser")
            header = cat.find_all('span')
            tables = cat.find_all("table")[0].find_all("td")        

            row = [
                #"name": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="name/1")[0].contents[0]).strip(),
                #"date": 
                header[2].contents[0],
                #"party": 
                re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
                #"start_date": 
                header[3].contents[0],
                #"end_date": 
                header[5].contents[0],
                #"electoral district": 
                re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
                #"registered association": 
                re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
                #"elected": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="elected/1")[0].contents[0]).strip(),
                #"address": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="address/1")[0].contents[0]).strip(),
                #"financial_agent": 
                re.sub("[\n\r/]", "", cat.find_all("table")[0].find_all("td", headers="fa/1")[0].contents[0]).strip()]


            csv_output.writerow(row)

2 个答案:

答案 0 :(得分:1)

我想你几乎得到了它;您只需找到tr中的所有table元素并循环遍历它们:

from bs4 import BeautifulSoup
import requests
import re
import csv


url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"

with open('scrapeAllRows.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)

    for i in range(1, 56):
        print(i)
        r  = requests.get(url.format(i))
        data = r.text
        soup = BeautifulSoup(data, "html.parser")
        links = []

        for link in soup.find_all('a', href=re.compile('selectedid=')):
            links.append("http://www.elections.ca" + link.get('href'))

        for link in links:
            r  = requests.get(link)
            data = r.text
            cat = BeautifulSoup(data, "html.parser")
            header = cat.find_all('span')
            table = cat.find("table")

            trs = table.find_all('tr')
            for tr in trs[1:]: #skip first row (table header)
                row = [
                    #"name": 
                    re.sub("[\n\r/]", "", tr.find("td", headers="name/1").contents[0]).strip(),
                    #"date": 
                    header[2].contents[0],
                    #"party": 
                    re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
                    #"start_date": 
                    header[3].contents[0],
                    #"end_date": 
                    header[5].contents[0],
                    #"electoral district": 
                    re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip(),
                    #"registered association": 
                    re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
                    #"elected": 
                    re.sub("[\n\r/]", "", tr.find("td", headers="elected/1").contents[0]).strip(),
                    #"address": 
                    re.sub("[\n\r/]", "", tr.find("td", headers="address/1").contents[0]).strip(),
                    #"financial_agent": 
                    re.sub("[\n\r/]", "", tr.find("td", headers="fa/1").contents[0]).strip()
                ]

                csv_output.writerow(row)

注意

trs = table.find_all('tr')
for tr in trs[1:]: #skip first row (table header)

我还使用find代替find_all("...")[0],因为它更具可读性IMO。 您可能需要一些try-catch块来确保存在某些元素,可能会定义一个新函数来处理解析部分,但除此之外它应该可以正常工作。

答案 1 :(得分:0)

部分解决您的问题。您需要的一切都在列表中。

from bs4 import BeautifulSoup
import requests

a = requests.get("http://www.elections.ca/WPAPPS/WPR/EN/NC/Details?province=-1&distyear=2013&district=-1&party=-1&selectedid=8561").content
soup = BeautifulSoup(a)
c=[]
for b in [a.strip() for a in soup.find("fieldset").text.split('\n') if a]:
    if b:
        c.append(b)
print c

输出:

[u'June 25, 2016', u'/', u'Conservative', u'Nomination contest report submitted by the registered party', u'Nomination contest dates (start - end):', u'May 12, 2016', u'to', u'June 25, 2016', u'Electoral district:', u'Medicine Hat--Cardston--Warner', u'Registered association:', u'Contestants:', u'Name', u'Address', u'Financial Agent', u'Brian Benoit', u'Medicine Hat, Alberta', u'T1B 3C6', u'Nikolai Punko', u'Medicine Hat, Alberta', u'T1A 2V4', u'Paul Hinman', u'Welling, Alberta', u'T0K 2N0', u'Robert B. Barfuss', u'Cardston, Alberta', u'T0K 0K0', u'Michael Jones', u'Raymond, Alberta', u'T0K 2S0', u'Dawn M. Hamon', u'Raymond, Alberta', u'T0K 2S0', u'Glen Motz', u'Medicine Hat, Alberta', u'T1B 0A7', u'Milvia Bauman', u'Medicine Hat, Alberta', u'T1C 1S4', u'Gregory Ranger', u'Raymond, Alberta', u'T0K 2S0', u'Stephen G. Archibald', u'Raymond, Alberta', u'T0K 2S0', u'Joseph Schow', u'Redcliff, Alberta', u'T0J 2P2', u'Daniel Schow', u'Sherwood Park, Alberta', u'T8A 1C6', u'Indicates the contestant who won this nomination contest.', u'Top of page']