我一直试图从维基百科的高级获奖者名单中提取一张表。这个表有一些价值,我不知道如何处理这些值。在循环细胞中我怎样才能包含无值桌子。维基百科页面的链接是:https://en.wikipedia.org/wiki/List_of_Nobel_laureates
import requests
from bs4 import BeautifulSoup
import pandas as pd
r=requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_laureates')
soup=BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', class_='wikitable')
rows = table.find_all('tr')
del rows[0]
for row in rows:
cells=row.find_all('td')
records=[]
print(cells)
year = cells[0].text
print("contents",cells[1].contents[1].text)
physics_winner = cells[1].contents[1].text
physics_url = cells[1].find('a')['href']
答案 0 :(得分:0)
只要您知道将在表中为None抛出什么异常,就可以使用try和except。我不确定您想要的输出但是我确实看到您正在导入大熊猫,所以请查看replace function以获得更好的替代try / excepts
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_laureates')
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', class_='wikitable')
rows = table.find_all('tr')
del rows[0]
for row in rows[:20]:
cells = row.find_all('td')
records = []
try:
year = cells[0].text
except IndexError:
year = 'n/a'
try:
print("contents", cells[1].contents[1].text)
except IndexError:
print("contents", 'n/a')
try:
physics_winner = cells[1].contents[1].text
except IndexError:
physics_winner = 'n/a'
try:
physics_url = cells[1].find('a')['href']
except TypeError:
physics_url = 'n/a'
答案 1 :(得分:0)
用于迭代细胞
import requests
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_laureates')
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', class_='wikitable')
rows = table.find_all('tr')
del rows[0]
for row in rows[:15]:
cells = row.find_all('td')
for cell in cells:
print(cell.text)
# print(cells)
# records = []
try:
year = cell.text
except IndexError:
year = 'n/a'
try:
print("contents", cell.contents[1].text)
except IndexError:
print("contents", 'n/a')
try:
physics_winner = cell.contents[1].text
except IndexError:
physics_winner = 'n/a'
try:
physics_url = cell.find('a')['href']
except TypeError:
physics_url = 'n/a'