对于我的任务,我试图从以下网站上抓取信息:https://www.blueroomcinebar.com/movies/now-showing/。
我的代码需要查找电影名称,时间和海报。电影时间和海报都按照我在HTML中显示的顺序显示在我创建的列表中,但是名称似乎按字母顺序排列。
我们不允许使用BeautifulSoup
这是我当前用于抓取影片的代码:
$> python mnist_tpu.py --use_tpu=false --master=''
当前,名称在列表中的顺序为
from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen
movies_name = []
movies_times = []
movies_image = []
movies_list = []
movies_page = urlopen("https://www.blueroomcinebar.com/movies/now-showing/").read().decode('utf-8')
#Add movies to Movies at Blue Room Screen
find_movie_names = findall(r'<h1>(.*?)</h1>', movies_page)
find_movie_times = findall(r'<p>([0-9]{1,2}:[0-9]{2} AM|PM)</p>', movies_page)
find_movie_image = findall(r'<div class="poster" style="background-image: url\((.*?)\)">', movies_page)
print(find_movie_names)
#Add movies to arrays
for movie in find_movie_names:
movies_name.append(movie)
for movie in find_movie_times:
movies_times.append(movie)
for movie in find_movie_image:
movies_image.append(movie)
print(movies_name)
print(movies_image)
for movie in range(len(movies_name)):
movies_list.append("{};{};{}".format(movies_name[movie], movies_times[movie], movies_image[movie - 1]))
它们应按以下顺序排列:
['Aladdin', 'Avengers: Endgame', 'Chandigarh Amritsar Chandigarh', 'John Wick - Parabellum', 'Long Shot', 'Pokemon Detective Pikachu', 'Poms', 'The Hustle', 'Top End Wedding']
N.P。 可能有第二部OCAP上映的电影。我不是100%知道为什么会这样,但这似乎是每天都会播放不同电影的某种特殊放映。