我试图从这个网站上抓取信息:“http://vlg.film/”
我不仅对前 15 个游戏感兴趣,而且对所有游戏都感兴趣。多次单击“显示更多”按钮时,“检查元素”窗口中会显示额外的标题,但网址保持不变,即“https://vlg.film/”。有没有人有一个或一些聪明的想法?我对此很陌生..谢谢
`
import requests as re
from bs4 import BeautifulSoup as bs
url = ("https://vlg.film/")
page = re.get(url)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
for det in wrap:
link = det.a['href']
print(link)
`
答案 0 :(得分:0)
看起来您可以简单地将分页添加到 url。诀窍是知道何时到达终点。玩弄它,它会在你到达结尾时出现,它会重复第一页。所以你需要做的就是不断地将链接附加到一个列表中,当你开始重复一个链接时,让它停止。
import requests as re
from bs4 import BeautifulSoup as bs
next_page = True
page_num = 1
links = []
while next_page == True:
url = ("https://vlg.film/")
payload = {'PAGEN_1': '%s' %page_num}
page = re.get(url, params=payload)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
for det in wrap:
link = det.a['href']
if link in links:
next_page = False
break
links.append(link)
page_num += 1
for link in links:
print(link)
输出:
/films/ainbo/
/films/boss-level/
/films/i-care-a-lot/
/films/fear-of-rain/
/films/extinct/
/films/reckoning/
/films/marksman/
/films/breaking-news-in-yuba-county/
/films/promising-young-woman/
/films/knuckledust/
/films/rifkins-festival/
/films/petit-pays/
/films/life-as-it-should-be/
/films/human-voice/
/films/come-away/
/films/jiu-jitsu/
/films/comeback-trail/
/films/cagefighter/
/films/kolskaya/
/films/golden-voices/
/films/bad-hair/
/films/dragon-rider/
/films/lucky/
/films/zalozhnik/
/films/findind-steve-mcqueen/
/films/black-water-abyss/
/films/bigfoot-family/
/films/alone/
/films/marionette/
/films/after-we-collided/
/films/copperfield/
/films/her-blue-sky/
/films/secret-garden/
/films/hour-of-lead/
/films/eve/
/films/happier-times-grump/
/films/palm-springs/
/films/unhinged/
/films/mermaid-in-paris/
/films/lassie/
/films/sunlit-night/
/films/hello-world/
/films/blood-machines/
/films/samsam/
/films/search-and-destroy/
/films/play/
/films/mortal/
/films/debt-collector-2/
/films/chosen-ones/
/films/inheritance/
/films/tailgate/
/films/silent-voice/
/films/roads-not-taken/
/films/jim-marshall/
/films/goya-murders/
/films/SUFD/
/films/pinocchio/
/films/swallow/
/films/come-as-you-are/
/films/kelly-gang/
/films/corpus-christi/
/films/gentlemen/
/films/vic-the-viking/
/films/perfect-nanny/
/films/farmageddon/
/films/close-to-the-horizon/
/films/disturbing-the-peace/
/films/trauma-center/
/films/benjamin/
/films/COURIER/
/films/aeronauts/
/films/la-belle-epoque/
/films/arctic-dogs/
/films/paradise-hills/
/films/ditya-pogody/
/films/selma-v-gorode-prizrakov/
/films/rainy-day-in-ny/
/films/ty-umeesh-khranit-sekrety/
/films/after-the-wedding/
/films/the-room/
/films/kuda-ty-propala-bernadett/
/films/uglydolls/
/films/smert-i-zhizn-dzhona-f-donovana/
/films/sinyaya-bezdna-2/
/films/just-a-gigolo/
/films/i-am-mother/
/films/city-hunter/
/films/lets-dance/
/films/five-feet-apart/
/films/after/
/films/100-things/
/films/greta/
/films/CORGI/
/films/destroyer/
/films/vice/
/films/ayka/
/films/van-gogh/
/films/serenity/
答案 1 :(得分:0)
这是一个非常简单的提取数据的网站。创建网页的 url 列表,您要提取多少页。然后使用for循环遍历所有页面提取数据。
import requests as re
from bs4 import BeautifulSoup as bs
urls = ["http://vlg.film/ajax/index_films.php?PAGEN_1={}".format(x) for x in range(1,11)]
for url in urls:
page = re.get(url)
soup = bs(page.content, 'html.parser')
wrap = soup.find_all('div', class_="column column--20 column--main")
print(url)
for det in wrap:
link = det.a['href']
print(link)