Question

我正在尝试使用硒和BeautifulSoup从网页（https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League）中提取一张表。

但是我被解析表所困扰。我只想要网页中的一个表，即“联赛表”，但是无论我尝试了什么，都会收到错误消息。

这是我尝试过的代码。

 import selenium 
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver.get("https://google.com")
elem = driver.find_element_by_xpath('//*[@id="tsf"]/div[2]/div[1]/div[1]/div/div[2]/input')
elem.send_keys("2018 epl")
elem.submit()
try:
   print(driver.title)
driver.find_element_by_partial_link_text("Wikipedia").click()
website = requests.get(driver.current_url).text

soup = BeautifulSoup(website, 'html.parser')

然后我遇到了麻烦。我已经尝试了几种代码，下面是其中之一。

rows=soup.find_all('td')

那么您可以帮助我完成我的代码吗？非常感谢。

Answer 1

您可以只使用熊猫read_html并通过适当的索引提取。但是，我将显示对bs4 4.7.1 +使用：has选择器，以确保您选择具有id League_table的h2，然后选择立即同级组合器来获取相邻表

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('h2:has(#League_table) + table')))
print(table)

只需read_html

import pandas as pd

tables = pd.read_html('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
print(tables[4])

Answer 2

也许可以帮助您开始：

import requests
from bs4 import BeautifulSoup

respond = requests.get('https://en.wikipedia.org/wiki/2018%E2%80%9319_Premier_League')
soup = BeautifulSoup(respond.text, 'lxml')
table = soup.find_all('table', {'class': 'wikitable'})

Answer 3

我通过在您的代码下方使用此代码来获得表格。

soup.body.find_all("table", class_="wikitable")[3]

我通过尝试和错误方法找到了表，即首先查看表的类，然后使用find_all，然后列出各个项目并验证输出。

我在解析网页中的表格时遇到麻烦

3 个答案: