我正在尝试从以下网页获取所有事件以及这些事件的其他元数据:https://alando-palais.de/events
我的问题是,result(html)不包含我要查找的信息。我想,它们被“隐藏”在某些php脚本后面。 该网址:“ https://alando-palais.de/wp/wp-admin/admin-ajax.php”
关于如何等待页面完全加载的任何想法,或者我必须使用哪种方法来获取事件信息?
这是我现在的脚本:-):
from bs4 import BeautifulSoup
from urllib.request import urlopen, urljoin
from urllib.parse import urlparse
import re
import requests
if __name__ == '__main__':
target_url = 'https://alando-palais.de/events'
#target_url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'
soup = BeautifulSoup(requests.get(target_url).text, 'html.parser')
print(soup)
links = soup.find_all('a', href=True)
for x,link in enumerate(links):
print(x, link['href'])
# for image in images:
# print(urljoin(target_url, image))
预期的输出将类似于:
这与结果无关:
<div class="vc_gitem-zone vc_gitem-zone-b vc_custom_1547045488900 originalbild vc-gitem-zone-height-mode-auto vc_gitem-is-link" style="background-image: url(https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg) !important;">
<a href="https://alando-palais.de/event/penthouse-club-special-maiwai-friends" title="Penthouse Club Special: Maiwai & Friends" class="vc_gitem-link vc-zone-link"></a> <img src="https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg" class="vc_gitem-zone-img" alt=""> <div class="vc_gitem-zone-mini">
<div class="vc_gitem_row vc_row vc_gitem-row-position-top"><div class="vc_col-sm-6 vc_gitem-col vc_gitem-col-align-left"> <div class="vc_gitem-post-meta-field-Datum eventdatum vc_gitem-align-left"> 08.03.2019
</div>
答案 0 :(得分:2)
您可以模仿页面上的xhr帖子
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'
data = {
'action': 'vc_get_vc_grid_data',
'vc_action': 'vc_get_vc_grid_data',
'tag': 'vc_basic_grid',
'data[visible_pages]' : 5,
'data[page_id]' : 30,
'data[style]' : 'all',
'data[action]' : 'vc_get_vc_grid_data',
'data[shortcode_id]' : '1551112413477-5fbaaae1-0622-2',
'data[tag]' : 'vc_basic_grid',
'vc_post_id' : '30',
'_vcnonce' : 'cc8cc954a4'
}
res = requests.post(url, data = data)
soup = BeautifulSoup(res.content, 'lxml')
dates = [item.text.strip() for item in soup.select('.vc_gitem-zone[style*="https://alando-palais.de"]')]
textInfo = [item for item in soup.select('.vc_gitem-link')][::2]
imageLinks = [item['src'].strip() for item in soup.select('img')]
titles = []
links = []
for item in textInfo:
titles.append(item['title'])
links.append(item['href'])
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])
print(results)
或者使用硒:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
url = 'https://alando-palais.de/events#'
driver = webdriver.Chrome()
driver.get(url)
dates = [item.text.strip() for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".vc_gitem-zone[style*='https://alando-palais.de']"))) if len(item.text)]
textInfo = [item for item in driver.find_elements_by_css_selector('.vc_gitem-link')][::2]
textInfo = textInfo[: int(len(textInfo) / 2)]
imageLinks = [item.get_attribute('src').strip() for item in driver.find_elements_by_css_selector('a + img')][::2]
titles = []
links = []
for item in textInfo:
titles.append(item.get_attribute('title'))
links.append(item.get_attribute('href'))
results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])
print(results)
driver.quit()
答案 1 :(得分:1)
我最好建议您selenium绕过所有服务器限制。
已编辑
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://alando-palais.de/events")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print elem.get_attribute("href")