我正在尝试抓取存储在此维基百科页面https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)的表格中的数据。但是我无法抓住存储在rowspan中的完整数据,这是我到目前为止写的:
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = urlopen("https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)")
soup = BeautifulSoup(wiki, "html.parser")
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if cells:
name = cells[0].find(text=True)
pic = cells[1].find("img")
strt = cells[2].find(text=True)
end = cells[3].find(text=True)
pri = cells[6].find(text=True)
z=name+'\n'+pic+'\n'+strt+'\n'+end+'\n'+pri
print z
答案 0 :(得分:0)
这是这个问题的唯一解决方案。在这里,我将rowspan,colspan表更改为简单表。 我在这个问题上浪费了很多天但很容易和找不到好的解决方案。在许多stackoverflow解决方案中,开发人员只抓取文本。但就我而言,我也需要网址链接。所以,我写了这段代码。 这对我有用
# this code written in beautifulsoup python3.5
# fetch one wikitable in html format with links from wikipedia
from bs4 import BeautifulSoup
import requests
import codecs
import os
url = "https://en.wikipedia.org/wiki/Ministry_of_Agriculture_%26_Farmers_Welfare"
fullTable = '<table class="wikitable">'
rPage = requests.get(url)
soup = BeautifulSoup(rPage.content, "lxml")
table = soup.find("table", {"class": "wikitable"})
rows = table.findAll("tr")
row_lengths = [len(r.findAll(['th', 'td'])) for r in rows]
ncols = max(row_lengths)
nrows = len(rows)
# rows and cols convert list of list
for i in range(len(rows)):
rows[i]=rows[i].findAll(['th', 'td'])
# Header - colspan check in Header
for i in range(len(rows[0])):
col = rows[0][i]
if (col.get('colspan')):
cSpanLen = int(col.get('colspan'))
del col['colspan']
for k in range(1, cSpanLen):
rows[0].insert(i,col)
# rowspan check in full table
for i in range(len(rows)):
row = rows[i]
for j in range(len(row)):
col = row[j]
del col['style']
if (col.get('rowspan')):
rSpanLen = int(col.get('rowspan'))
del col['rowspan']
for k in range(1, rSpanLen):
rows[i+k].insert(j,col)
# create table again
for i in range(len(rows)):
row = rows[i]
fullTable += '<tr>'
for j in range(len(row)):
col = row[j]
rowStr=str(col)
fullTable += rowStr
fullTable += '</tr>'
fullTable += '</table>'
# table links changed
fullTable = fullTable.replace('/wiki/', 'https://en.wikipedia.org/wiki/')
fullTable = fullTable.replace('\n', '')
fullTable = fullTable.replace('<br/>', '')
# save file as a name of url
page=os.path.split(url)[1]
fname='outuput_{}.html'.format(page)
singleTable = codecs.open(fname, 'w', 'utf-8')
singleTable.write(fullTable)
# here we can start scraping in this table there rowspan and colspan table changed to simple table
soupTable = BeautifulSoup(fullTable, "lxml")
urlLinks = soupTable.findAll('a');
print(urlLinks)
# and so on .............