当所有需要的数据没有格式化为文本时,如何抓取评论?

时间:2016-08-08 21:06:33

标签: python-3.x web-scraping beautifulsoup python-requests

我试图抓住大学研究的评论。我的代码打印出了我需要的大部分信息,但我还需要找到评级和userId。

这是我的一些代码。

import requests
from bs4 import BeautifulSoup


s = requests.Session()

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
           'Referer': "http://www.imdb.com/"}


url = 'http://www.imdb.com/title/tt0082158/reviews?ref_=tt_urv'
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()

cj = s.cookies
requests.utils.dict_from_cookiejar(cj)

s.post(url, headers=headers)

for i in soup('style'):
    i.decompose()
for s in soup('script'):
    s.decompose()
for t in soup('table'):
    t.decompose()
for ip in soup('input'):
    ip.decompose()

important = soup.find("div", id='tn15content')

print(important.text)

这将返回我在打印输出中需要的大部分信息。

OUTPUT(只显示这一条评论,在页面上打印出所有这些评论)

120 out of 141 people found the following review useful:

This is one of the Oscar best pictures that actually deserved the honor.

Author:
gachronicled from USA
18 February 2001



I happened to be flipping channels today and saw this was on.  Since it
had
been several years since I last saw it I clicked it on, but didn't mean to
stay.  As it happened, I found this film to be just as gripping now as it
was before.  My own kids started watching it, too, and enjoyed it - which
was even more satisfying for me considering the kind of current junk
they're
used to.  No, this is not an action-packed thriller, nor are there juicy
love scenes between Abrahams and his actress girlfriend.  There is no
"colorful" language to speak of; no politically correct agenda underlying
its tale of a Cambridge Jew and Scottish Christian.This is a story about what drives people internally - what pushes them to
excel or at least to make the attempt to do so.  It is a story about
personal and societal values, loyalty, faith, desire to be accepted in
society and healthy competition without the utter selfishness that
characterizes so much of the athletic endeavors of our day.  Certainly the
characters are not alike in their motivation, but the end result is the
same
as far as their accomplishments.My early adolescent son (whose favorite movies are all of the Star Wars
movies and The Matrix) couldn't stop asking questions throughout the movie
he was so hooked.  It was a great educational opportunity as well as
entertainment.  If you've never seen this film or it's been a long time, I
recommend it unabashedly, regardless of the labels many have tried to give
it for being slow-paced or causing boredom.  In addition to the great
story
- based on real people and events - the photography and the music are
fabulous and moving.  It's no mistake that this movie has been spoofed and
otherwise stolen from in the last twenty years - it's an unforgettable
movie
and in my opinion its bashers are those who hate Oscar winners on
principle
or who don't like the philosophies espoused by its protagonists.

但是,我还需要为每部电影提供userID和评级。

userID包含在每个href元素中,如此...

<a href="/user/ur0511587/">

评级包含在每个img元素中,其中评级等于&#34; 10/10&#34;在alt属性中。

<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif">

除了可以通过打印轻松抓取的输出之外,还有关于如何抓取这两个项目的任何提示&#34; important.text&#34;没有印刷&#34;重要的&#34;?我对犹豫不决只是打印&#34;重要的&#34;因为所有的标签和其他不必要的东西都会很混乱。感谢您的任何意见。

1 个答案:

答案 0 :(得分:3)

您可以使用 css选择器a[href^=/user/ur]会找到所有以/user/ur开头的href的锚点,img[alt*=/10]会找到所有 img 标签,其 alt 属性的值为"some_number/10"

user_ids = [a["href"].split("ur")[1].rstrip("/") for a in important.select("a[href^=/user/ur]")]
ratings = [img["alt"] for img in important.select("img[alt*=/10]")]

print(user_ids, ratings)

现在的问题是,并非每个评论都有评分,只是找到每个 a [href ^ = / user / ur] 会给我们提供超出我们想要的评价,所以要处理我们可以找到包含锚点和评论(如果有)的特定 div ,方法是找到包含评论有用的文字的小标记 ,然后调用 .parent 来选择div。

import re
important = soup.find("div", id='tn15content')

for small in important.find_all("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")

现在我们得到:

('0511587', '10/10')
('0209436', '9/10')
('1318093', 'N/A')
('0556711', '10/10')
('0075285', '9/10')
('0059151', '10/10')
('4445210', '9/10')
('0813687', 'N/A')
('0033913', '10/10')
('0819028', 'N/A')

您还需要做更多的工作来获取源代码,您只需要一个获取请求,所需的完整代码将是:

import requests
from bs4 import BeautifulSoup
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
url = 'http://www.imdb.com/title/tt0082158/reviews?ref_=tt_urv'

soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml")


important = soup.find("div", id='tn15content')

for small in important.find_all("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")

要获取评论文本,只需找到div之后的下一个p:

for small in important.find_all("small", text=re.compile("review useful:")):
    div = small.parent
    user_id = div.select_one("a[href^=/user/ur]")["href"].split("ur")[1].rstrip("/")
    rating = div.select_one("img[alt*=/10]")
    print(user_id, rating["alt"] if rating else "N/A")
    print(div.find_next("p").text.strip())

这将为您提供如下输出:

('0511587', '10/10')
I happened to be flipping channels today and saw this was on.  Since it
had
been several years since I last saw it I clicked it on, but didn't mean to
stay.  As it happened, I found this film to be just as gripping now as it
was before.  My own kids started watching it, too, and enjoyed it - which
was even more satisfying for me considering the kind of current junk
they're
used to.  No, this is not an action-packed thriller, nor are there juicy
love scenes between Abrahams and his actress girlfriend.  There is no
"colorful" language to speak of; no politically correct agenda underlying
its tale of a Cambridge Jew and Scottish Christian.This is a story about what drives people internally - what pushes them to
excel or at least to make the attempt to do so.  It is a story about
personal and societal values, loyalty, faith, desire to be accepted in
society and healthy competition without the utter selfishness that
characterizes so much of the athletic endeavors of our day.  Certainly the
characters are not alike in their motivation, but the end result is the
same
as far as their accomplishments.My early adolescent son (whose favorite movies are all of the Star Wars
movies and The Matrix) couldn't stop asking questions throughout the movie
he was so hooked.  It was a great educational opportunity as well as
entertainment.  If you've never seen this film or it's been a long time, I
recommend it unabashedly, regardless of the labels many have tried to give
it for being slow-paced or causing boredom.  In addition to the great
story
- based on real people and events - the photography and the music are
fabulous and moving.  It's no mistake that this movie has been spoofed and
otherwise stolen from in the last twenty years - it's an unforgettable
movie
and in my opinion its bashers are those who hate Oscar winners on
principle
or who don't like the philosophies espoused by its protagonists.