使用BeautifulSoup通过HTML类废弃文本返回null

时间:2013-10-09 03:38:36

标签: python beautifulsoup

我正在尝试从此页面(http://www.amazon.com/Learning-Java-Patrick-Niemeyer/dp/1449319246%3FSubscriptionId%3DAKIAIZJQKUHUCXRLH6MQ%26tag%3Dyuplayit-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3D1449319246)获取所有评论信息,标记<div class=“drkgry”>....</div>内的文字但始终显示返回[]。我不知道发生了什么。

蟒:

import bs4 from BeautifulSoup
data = open("example_1.html").read()
soup = BeautifulSoup(data)
soup.find_all("div",class="drkgry")

我也试过了soup.findall("div",class="drkgry"), soup.find_all('div', attrs ={'class':'drkgry'}),,但它们无效。

我要废弃的数据来源:

</div>  <div class="txtsmall mt4 fvavp"><span class="inlineblock formatVariation"><span class="gr3 gry formatKey">Format:</span><span class="formatValue">Paperback</span></span></div>  <div class="mt9 reviewText">






<div class="drkgry">
  Learning Java (Fourth Edition) is book for Java practitioner as reference book. This covers lot of topics.<br><br>This is an excellent book for someone who knows basics of programming. This book is not beginners. This book lacks examples and exercises which may disappoint few people.<br><br>Book has 24 chapters covering almost all of basic Java.  The chapter one talks about historical aspects. Second chapter is brief introduction of java but it assumes that reader is aware of programming, OOP, threading etc which is difficult for any beginner.
</div>

</div>  <div class="clearboth txtsmall gt9 vtStripe">    <div class="fl cmt">

有没有人帮我解决问题?

2 个答案:

答案 0 :(得分:3)

我运行了这个确切的脚本:

import urllib
from bs4 import BeautifulSoup as BS

html =urllib.urlopen('http://www.amazon.com/dp/1449319246/?tag=stackoverfl08-20').read()

soup = BS(html)

print soup.findAll('div',{'class':'drkgry'})[1].get_text()

并打印出来:

  

学习Java(第四版)是Java从业者的书,作为参考书。这涵盖了很多主题。对于了解编程基础知识的人来说,这是一本很好的书。这本书不是初学者。本书缺少可能让很少人失望的例子和练习。本书共有24章,几乎涵盖了所有基础Java。第一章讨论历史方面。第二章是对Java的简要介绍,但它假设读者了解编程,OOP,线程等对于任何初学者来说都是困难的。

如果你在没有索引soup.findAll的情况下运行它,那么它会为你提供评论中所有信息的列表

答案 1 :(得分:0)

使用:

class_="drkgry"

而不是:

class = "drkgry"

这就是我想的全部。