如何使用python和Beautifulsoup抓取网页的第二个<p>

时间:2019-01-08 13:50:13

标签: python html beautifulsoup

我一直试图与BeautifulSoup合作,因为我想尝试抓取网页(https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1)。到目前为止,我成功地刮了一些元素,但是现在我想刮电影的描述,但是我一直在挣扎。标题只是这样放置在html中:

<div class="lister-item mode-advanced"> 
    <div class="lister-item-content> 
       <p class="muted-text"> paragraph I don't need</p>
       <p class="muted-text"> paragraph I need</p>
    </div>
</div>

我想抓紧第二段,这似乎很容易做到,但是我尝试的一切都给了我“无”的输出。我一直在四处寻找答案。在另一个stackoverflow帖子中,我发现

find('p:nth-of-type(1)')  

find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')

可以解决问题,但仍然可以给我

none #as output

下面您可以找到我的一部分代码,因为我只是在尝试学习的东西,所以这是一个低等级的代码

 import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title? 
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode- 
advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description

上面的代码给了我这个输出:

$ python scrape.py
Logan
(2017)
8.1
None

我想学习选择html标签的正确方法,因为这对于以后的项目很有用。

3 个答案:

答案 0 :(得分:1)

  

find_all()方法浏览标签的后代并检索   与您的过滤器匹配的所有后代。

然后可以使用列表的索引获取所需的元素。索引从0开始,所以1将给出第二个项目。

将first_description更改为此。

first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()

完整代码

import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description

输出

Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.

阅读Documentation,以了解选择html标签的正确方法。

也可以考虑使用python3。

答案 1 :(得分:0)

width:100%;就可以得到它。不过,可能还有一种更优雅的方式。至少可以为您提供一个开始/一些方向

.next_sibling

输出:

from bs4 import BeautifulSoup


html = '''<div class="lister-item mode-advanced"> 
    <div class="lister-item-content> 
       <p class="muted-text"> paragraph I don't need</p>
       <p class="muted-text"> paragraph I need</p>
    </div>
</div>'''


soup = BeautifulSoup(html, 'html.parser')


first_p = soup.find('div',{'class':'lister-item mode-advanced'}).text.strip()
second_p = soup.find('div',{'class':'lister-item mode-advanced'}).next_sibling.next_sibling.text.strip()



print (second_p)

答案 2 :(得分:0)

BeautifulSoup 4.71支持:nth-child()或任何CSS4选择器

first_description = soup.select_one('.lister-item-content p:nth-child(4)')
# or 
#first_description = soup.select_one('.lister-item-content p:nth-of-type(2)')

print(desc)