如何使用python2.7从网站获取特定文本

时间:2017-09-04 16:09:24

标签: python python-2.7 web-scraping beautifulsoup

我想创建一个简单的程序,从网站收集简单文本,例如,如果用户想要歌曲的歌词,我该如何让程序收集它 e.g。

  

https://www.azlyrics.com/lyrics/runthejewels/closeyoureyesandcounttofuck.html   如何从这个网站收集歌词部分?

1 个答案:

答案 0 :(得分:1)

您可以使用requests获取HTML,然后使用BeautifulSoup进行解析。以下内容在HTML之前查找HTML注释,然后查找包含它的父<div>。从中可以提取文本:

import requests
from bs4 import BeautifulSoup, Comment

r = requests.get("https://www.azlyrics.com/lyrics/runthejewels/closeyoureyesandcounttofuck.html", headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36'})
soup = BeautifulSoup(r.content, "html.parser")

for comment in soup.find_all(string=lambda text:isinstance(text, Comment)):
    if "Usage of azlyrics.com content" in comment:
        print comment.parent.text

这会给你一些启动:

[Zack De La Rocha:]
Run them jewels fast, run them, run them jewels fast
...

如果需要,可以安装这些库,如下所示:

pip install beautifulsoup4
pip install requests