Question

我想在这里废弃一个电影网站：http://www.21cineplex.com/nowplaying

我已将HTML正文的屏幕截图上传为此问题中的图片。link to screenshot here我在尝试抓取电影标题和<P>标记中的描述时遇到了困难。由于某些奇怪的原因，描述不是请求对象的一部分。此外，当我尝试使用汤来查找ul和类名时，无法找到它。谁知道为什么？我正在使用python 3.这是我到目前为止的代码：

    r = requests.get('http://www.21cineplex.com/nowplaying')
    r.text (no description here)
    soup = bs4.BeautifulSoup(r.text)
    soup.find('ul', class_='w462') # why is this empty?

Answer 1

此服务器正在检查Referer标头。如果没有Referer则会发送主页面。但它不会检查此标题中的文本，因此它甚至可以是空字符串。

import requests
import bs4

headers = {
    #'Referer': any url (or even random text, or empty string)

    #'Referer': 'http://google.com',
    #'Referer': 'http://www.21cineplex.com',
    #'Referer': 'hello world!',
    'Referer': '',
}

s = requests.get('http://www.21cineplex.com/nowplaying', headers=headers)
soup = bs4.BeautifulSoup(s.text)

for x in soup.find_all('ul', class_='w462'):
    print(x.text)

for x in soup.select('ul.w462'):
    print(x.text)

for x in soup.select('ul.w462'):
    print(x.select('a')[0].text)
    print(x.select('p')[0].text)

无法使用BeautifulSoup刮掉这个电影网站

1 个答案: