Question

我想做的是每当网站上有新东西时，我的不和谐机器人只会发送一条消息说“嘿，那里有新东西”。例如，有一个图书网站，他们上传关于图书的新帖子及其描述，我的机器人只是从该帖子中获取在线文本并将其发送到我的 Discord 服务器。我希望它足够清楚。这里我有我在 Python 3.9 中制作的基本不和谐机器人代码

import discord 
from discord.ext import commands

client = commands.Bot(command_prefix = '!')

@client.event 
async def on_ready():
    print("Bot is working.")

client.run('not today')

Answer 1

您可以使用 tasks.loop 查看新闻：

import bs4
import aiohttp
from discord.ext import tasks

@tasks.loop(minutes=1)
async def check_news():
  async with aiohttp.ClientSession() as ses:
    async with ses.get(your_url) as response:
      if response.status == 200:
        text = await response.text()
        soup = bs4.BeautifulSoup(text, "html.parser")
        #finding the news
        #if there is a new post, you can send it to spesific channel.

如果你能分享链接，我可以提供更多帮助。

Answer 2

有关更多详细信息，我建议您查看 discord.ext.tasks module 的文档，它允许您为您的机器人运行后台任务。这对于更加个性化的框架实施尤其方便。

问题的两部分都不太难：

创建一个网页抓取工具来检查页面 HTML 中的更新
创建一个使用上述网络抓取工具的后台任务。

创建网络爬虫

用于网页抓取的软件包完全取决于开发人员的愿望/需要。由于 discord.py 使用 asyncio，您应该使用异步解析器，例如 aiohttp 或 requests-html，而不是 urllib 或 requests，后者正在阻止。

使用 AIOHTTP

import aiohttp

RECENT_HTML = ""

async def download_webpage():
    async with aiohttp.ClientSession() as session:
        async with session.get("<url>") as response:
            if response.status != 200:
                # Notify users that the website could not be scraped

            html = await response.text()
            if html != RECENT_HTML:
                # Notify users of changes within the website
                # An HTML parser could be used to identify specific changes within the HTML
                # Or you could just tell the members that a change occurred.
            RECENT_HTML = html

这些 download_webpage() 协程创建一个会话来下载网页（用网站的实际 URL 替换 "<url>"，然后通过将页面 HTML 与 {{1 }}。RECENT_HTML 只存储被抓取的最新版本的 HTML，用于比较。要检查的 HTML 不必存储为变量，例如它可以写入文件。< /p>

如果 HTML 不同，您可以简单地通知成员，或者您可以使用 HTML 解析器来获取确切的差异。请注意，这些更改可能是细微且无关紧要的（例如，页面上的广告在检查之间发生了更改），因此我建议您检查特定元素中的更改。（但是，这样做超出了本问题的范围。）

最后，页面 HTML 的新副本存储在变量中（或者存储最新版本的 HTML）。

使用请求-HTML

RECENT_HTML

创建后台任务

discord.ext.tasks.loop decorator 环绕协程，将其调度为以确定间隔运行的后台任务。间隔（作为浮点数或整数）可以以秒、分钟、小时或三者的组合为单位。

from requests_html import AsyncHTMLSession

RECENT_HTML = ""

async def download_webpage():
    asession = AsyncHTMLSession()
    response = await asession.get("<url>")
    if response.status_code != 200:
        # Notify users that the website could not be scraped
    
    html = response.html.text
    if html != RECENT_HTML:
        # Notify users of changes within the website
        # An HTML parser could be used to identify specific changes within the HTML
        # Or you could just tell the members that a change occurred.
    RECENT_HTML = html

因此，将两者结合起来，您的网络爬虫任务可能如下所示：

from discord.ext import tasks

@tasks.loop(seconds=5.0)
async def my_task():
    # Do something that is repeated every 5 seconds

是否可以让机器人基于网站发送消息？

2 个答案:

创建网络爬虫

使用 AIOHTTP

使用请求-HTML

创建后台任务