Question

我正在使用Python atm进行网络抓取，发现一个问题，我想抓取一个网站，该网站包含我以前看过的动漫列表，但是当我尝试抓取（通过请求或硒）它时，它只会页面上110个动漫名称中的30个左右。这是我的硒代码：

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
    print(title.getText())

当我运行它时，页面源只会显示直到名为“黄金时间”的动漫，而页面中还剩下70或更多。

谢谢

编辑：感谢'supputuri'，该代码现在可以使用了：

from selenium import webdriver
from bs4 import BeautifulSoup
import time

driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
    preY = footer.rect['y']
    footer.location_once_scrolled_into_view
    print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
    print(title.getText())
driver.close()
driver.quit()
ret = input()

Answer 1

因此，这是我在加载页面源代码时得到的帮助：

AniListwindow.al_token ='E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';对不起，AniList需要Javascript。
请启用Javascript或http://outdatedbrowser.com>升级到现代Web浏览器。对不起，AniList。 />请http://outdatedbrowser.com>升级到较新的网络浏览器。

由于我非常清楚该启用了Javascript，并且我的Chrome版本是最新的，并且列出的网址将其带到一个不安全的网站上以“下载”新版本的浏览器，因此我认为这是垃圾网站。不知道发布时是否知道这一点，因此我不会举报，但我希望您和遇到此问题的其他人知道。

Answer 2

这是解决方案。确保添加import time

driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
    preY = footer.rect['y']
    footer.location_once_scrolled_into_view
    time.sleep(1)
print(str(driver.page_source))

这将重复进行，直到所有动画加载完毕，然后获取页面源。让我们知道这是否有帮助。

无法使用Python WebScraping从列表中获取所有标题

2 个答案: