我正在尝试抓取该特定网站的内容:https://www.cineatlas.com/
我尝试刮过日期部分,如打印屏幕所示:
我使用了这个基本的beautifulsoup代码
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text,'html.parser')
type(soup)
time = soup.find('ul',class_='slidee')
这是我得到的,而不是元素列表
<ul class="slidee">
<!-- adding dates -->
</ul>
答案 0 :(得分:2)
站点从Javascript内容动态创建HTML元素。例如,可以使用re
获取JS内容:
import re
import json
import requests
from ast import literal_eval
url = 'https://www.cineatlas.com/'
html_data = requests.get(url).text
movieData = re.findall(r'movieData = ({.*?}), movieDataByReleaseDate', html_data, flags=re.DOTALL)[0]
movieData = re.sub(r'\s*/\*.*?\*/\s*', '', movieData) # remove comments
movieData = literal_eval(movieData) # in movieData you have now the information about the current movies
print(json.dumps(movieData, indent=4)) # print data to the screen
打印:
{
"2019-08-06": [
{
"url": "fast--furious--hobbs--shaw",
"image-portrait": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603443098_891497ecc8b16b3a662ad8b036820ed1_500x735.jpg",
"image-landscape": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603421049_7c233477779f25725bf22aeaacba469a_700x259.jpg",
"title": "FAST & FURIOUS : HOBBS & SHAW",
"releaseDate": "2019-08-07",
"endpoint": "ST00000392",
"duration": "120 mins",
"rating": "Classification TOUT",
"director": "",
"actors": "",
"times": [
{
"time": "7:00pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8388?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
"attributes": [
{
"_id": "5d468c20f67cc430833a5a2b",
"shortName": "VF",
"description": "Version Fran\u00e7aise"
},
{
"_id": "5d468c20f67cc430833a5a2a",
"shortName": "3D",
"description": "3D"
}
]
},
{
"time": "9:50pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8389?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
... and so on.
答案 1 :(得分:0)
lis = time.findChildren()
这将返回子节点列表