我尝试使用BeautifulSoup的find_all()来搜索带有标记的元素," div"和班级," wisbb_name"。 HTML I'抓取来自http://www.foxsports.com/mlb/scores。我的最终目标是根据该网站获取当天开始的所有投手的名字。投手名称的HTML在
之下 <div class="wisbb_name">M. Fiers</div>
所有投手HTML代码都有相同的类,只是与之关联的不同文本。我已经使用下面的代码行来获取find_all()的所有结果并获取与之关联的文本。
for el in soup.find():
print(el.get_text())
这很好用,问题是find_all()找不到我想要它找到的元素,无论我改变多少参数。根据BeautifulSoup documentation,下面的代码行应该找到具有类的元素,&#34; wisbb_name&#34;和标签,&#34; div&#34;。
variable = soup.find_all("div", class_="wisbb_name")
print(variable)
打印变量后,我只得到一个空列表。我不确定我是否在python中以错误的方式解决这个问题,或者我需要更多地了解HTML的工作原理。我有最新版本的BeautifulSoup,我使用的是Python 3.6.2。我目前的完整代码如下。
import requests
from bs4 import BeautifulSoup
page = requests.get("url from top because I can't use 3 links")
soup = BeautifulSoup(page.content, "lxml")
for el in soup.find_all("div", class_="wisbb_name"):
print(el.get_text())
答案 0 :(得分:1)
使用JavaScript呈现文本。 首先使用dryscrape
呈现页面import bs4 as bs
import dryscrape
url = ("http://www.foxsports.com/mlb/scores")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("div", class_="wisbb_name"):
print(el.get_text())
输出:
A. Sanchez
E. Santana
J. Shields
I. Kennedy
T. Williams
J. Hoffman
M. Scherzer
Z. Godley
C. Sale
R. Nolasco
C. Sabathia
A. Moore
J. García
A. Wood
T. Cahill
J. Samardzija
或使用硒...... 首先安装它:
sudo pip3 install selenium
然后获得一个驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("http://www.foxsports.com/mlb/scores")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("div", class_="wisbb_name"):
print(el.get_text())
或PyQt5:
from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = "http://www.foxsports.com/mlb/scores"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("div", class_="wisbb_name"):
print(el.get_text())