更新：

Question

我正在尝试从本地bookie网站抓取比赛及其各自的赔率，但是我尝试使用我的网络抓取工具的每个网站都不会返回任何内容，而只是打印“进程退出代码为0结束”，但不返回任何内容。有人可以帮我打开容器，取出里面的东西。

我已经尝试了上述所有网站近一个月，但没有成功。问题似乎出在确切的div，class或跨度元素布局上。

例如，我尝试显示代码中的链接2

import requests
from bs4 import BeautifulSoup

url = "https://www.betpawa.ug/"

response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")

for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
    print (match.text.strip())

我希望程序返回匹配，赔率和容器的所有其他组件的列表。但是该程序将运行，并且仅显示““退出代码为0的过程完成””

Answer 1

看来基站分两个阶段加载了

为页面加载一些HTML结构，
使用JavaScript填写内容

您可以通过右键单击页面，执行“查看页面源代码”，然后搜索“事件容器”（不存在）来证明这一点。

因此，您将需要比request + bs4更强大的功能。我听说有人使用Selenium来做到这一点，但我对此并不熟悉。

Answer 2

您应该考虑使用urllib3而不是requests。

from urllib.request import Request, urlopen。

-建立您的要求：
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

-检索文档：
res = urlopen(req)

-使用bs4进行解析：
html = BeautifulSoup (res, 'html.parser')

Answer 3

就像Chris Curvey所描述的那样，问题在于请求无法执行页面的JavaScript。如果您打印内容变量，则可以看到页面上显示如下消息：“需要JavaScript！为向您提供最佳产品，我们的网站需要JavaScript才能运行...”使用Selenium，您可以控制完整的浏览器形式WebDriver的版本（用于Google Chrome浏览器的示例ChromeDriver二进制文件）：

from bs4 import BeautifulSoup
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options) 

url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
    print (match.text.strip())

更新：

在第13行中，命令print（match.text.strip（））仅提取每个match-div的具有class-属性“ events-container”的文本元素。如果要提取更具体的内容，则可以通过match变量访问每个匹配项。您需要知道：

您想要哪个可用信息
以及如何在match-div的内部识别此信息结构。
您需要哪种数据类型的信息

要使其易于运行，请使用F12键打开chrome的开发人员工具，在左上角，您现在会看到“选择元素...”的图标，如果单击该图标，然后在浏览器中单击所需元素，则在图标下方的区域中将看到等效源。仔细分析以获得所需的信息，例如：

足球比赛的标题是match-div中的第一个h3-Tag 并且是一个字符串
显示的奇数是带有事件奇数类和数字（浮点数/双精度）

在Google或您所使用的软件包的引用（BeautifulSoup4）中搜索所需的功能。让我们尝试通过使用match变量上的BeautifulSoup函数来使其快速又脏，以免获得整个网站的元素（已用制表符替换了空白）：

# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
    title = title_tags[0].getText()  # get the text of the first one
    print("Title: ", title) # show it
else:
    print("no h3-tags found")
    exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
    odds = [] # create an list
    for tag in odds_tags:   # loop over the odds_tags we found
        odd = tag.getText() # get the text
        print("Odd: ", odd)
        # good but it is an string, you can't compare it with an number in
        # python and expect an good result.
        # You have to clean it and convert it: 
        clean_odd = odd.strip() # remove empty spaces
        odd = float(clean_odd)  # convert it to float
        print("Odd as Number:", odd)
else:
    print("something wen't wrong with the odds")
    exit()
input("Press enter to try it on the next match!")

Python美丽的汤网刮板不返回标签内容

3 个答案:

更新：