Beautifulsoup,阅读特定的href

时间:2016-11-06 23:19:01

标签: python-3.x beautifulsoup

我想从下面的网页上阅读莎士比亚的剧本,并将数据收集到数据框中以供进一步分析:http://shakespeare.mit.edu/cymbeline/index.html

我正在使用beatifulsoup来阅读带我到每个ACT网页的超链接,我可以在那里收集数据。我使用下面的代码来收集每个行为的超链接作为列表

> play1 = "http://shakespeare.mit.edu/cymbeline/index.html" play =
> urlopen(play1).read() soup = BeautifulSoup(play,"lxml") tr_act =
> soup.find_all("a") for i in tr_act:
>     print (i.get('href'))

由于html页面结构,我还在列表中获得了一些我不需要的其他项目

/Shakespeare
http://www.amazon.com/gp/product/1903436028?ie=UTF8&tag=theinteclasar-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1903436028
full.html
cymbeline.1.1.html
cymbeline.1.2.html....

如何以编程方式避免在我的刮刀代码中读取前3个href元素。 HTML结构非常微妙,我无法确定如何组织我的代码来获得

>  <p>You can buy the Arden text of this play from the Amazon.com online
> bookstore: <a
> href="https://rads.stackoverflow.com/amzn/click/com/1903436028" rel="nofollow noreferrer">Cymbeline:
> Second Series - Paperback (The Arden Shakespeare. Second
> Series)</a></p>   <p><a href="full.html">Entire play</a> in one
> page</p>   <p>
>      Act 1, Scene 1: <a href="cymbeline.1.1.html">Britain. The garden of Cymbeline's palace.</a><br>
>      Act 1, Scene 2: <a href="cymbeline.1.2.html">The same. A public place.</a><br>
>      Act 1, Scene 3: <a href="cymbeline.1.3.html">A room in Cymbeline's palace.</a><br>
>      Act 1, Scene 4: <a href="cymbeline.1.4.html">Rome. Philario's house.</a><br>

1 个答案:

答案 0 :(得分:1)

尝试以下方法:

GameObjects