使用python进行Web抓取表

时间:2017-12-29 03:14:32

标签: python beautifulsoup

我一直试图从维基百科的高级获奖者名单中提取一张表。这个表有一些价值,我不知道如何处理这些值。在循环细胞中我怎样才能包含无值桌子。维基百科页面的链接是:https://en.wikipedia.org/wiki/List_of_Nobel_laureates

import requests
from bs4 import BeautifulSoup
import pandas as pd
r=requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_laureates')
soup=BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', class_='wikitable')

rows = table.find_all('tr')
del rows[0]


for row in rows:
    cells=row.find_all('td')
    records=[]
    print(cells)


    year = cells[0].text
    print("contents",cells[1].contents[1].text)
    physics_winner = cells[1].contents[1].text
    physics_url = cells[1].find('a')['href']  

2 个答案:

答案 0 :(得分:0)

只要您知道将在表中为None抛出什么异常,就可以使用try和except。我不确定您想要的输出但是我确实看到您正在导入大熊猫,所以请查看replace function以获得更好的替代try / excepts

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd

    r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_laureates')
    soup = BeautifulSoup(r.text, 'html.parser')
    table = soup.find('table', class_='wikitable')

    rows = table.find_all('tr')
    del rows[0]


    for row in rows[:20]:
        cells = row.find_all('td')
        records = []

        try:
            year = cells[0].text
        except IndexError:
            year = 'n/a'

        try: 
            print("contents", cells[1].contents[1].text)
        except IndexError:
            print("contents", 'n/a')

        try: 
            physics_winner = cells[1].contents[1].text
        except IndexError:
            physics_winner = 'n/a'

        try:
            physics_url = cells[1].find('a')['href'] 
        except TypeError:
            physics_url = 'n/a'

答案 1 :(得分:0)

用于迭代细胞

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd

    r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_laureates')
    soup = BeautifulSoup(r.text, 'html.parser')
    table = soup.find('table', class_='wikitable')

    rows = table.find_all('tr')
    del rows[0]


    for row in rows[:15]:
        cells = row.find_all('td')
        for cell in cells:
            print(cell.text)
            # print(cells)
            # records = []


            try:
                year = cell.text
            except IndexError:
                year = 'n/a'

            try: 
                print("contents", cell.contents[1].text)
            except IndexError:
                print("contents", 'n/a')

            try: 
                physics_winner = cell.contents[1].text
            except IndexError:
                physics_winner = 'n/a'

            try:
                physics_url = cell.find('a')['href'] 
            except TypeError:
                physics_url = 'n/a'