使用python访问链接并打印数据

时间:2015-06-03 21:55:30

标签: python web-scraping beautifulsoup

我正在写一个网络刮刀并试图找回德雷克的歌词。 我的刮刀必须访问一个站点(主要的metrolyrics网站),然后访问每个单独的歌曲链接,然后打印出歌词。

我在访问第二个链接时遇到问题。我在BeautifulSoup上搜索过,我很困惑。我想知道你是否可以提供帮助。

# this is intended to print all of the drake song lyrics on metrolyrics

from pyquery import PyQuery as pq
from lxml import etree
import requests
from bs4 import BeautifulSoup

# this visits the website
response = requests.get('http://www.metrolyrics.com/drake-lyrics.html')

# this separates the different types of content
doc = pq(response.content)

# this finds the titles in the content
titles = doc('.title')

# this visits each title, then prints each verse
for title in titles:
    # this visits each title
  response_title = requests.get(title)
    # this separates the content
  doc2 = pq(response_title.content)
    # this finds the song lyrics
  verse = doc2('.verse')
    # this prints the song lyrics
  print verse.text

在response_title = requests.get(title)中,python没有认识到标题是一个链接,这是有道理的。但是我怎么能得到实际的呢?感谢您的帮助。

2 个答案:

答案 0 :(得分:4)

替换

response_title = requests.get(title)

response_title = requests.get(title.attrib['href'])

完整的工作脚本(以下评论中的固定注释)

#!/usr/bin/python

from pyquery import PyQuery as pq
from lxml import etree
import requests
from bs4 import BeautifulSoup

# this visits the website
response = requests.get('http://www.metrolyrics.com/drake-lyrics.html')

# this separates the different types of content
doc = pq(response.content)

# this finds the titles in the content
titles = doc('.title')

# this visits each title, then prints each verse
for title in titles:
    # this visits each title
  #response_title = requests.get(title)
  response_title = requests.get(title.attrib['href'])

    # this separates the content
  doc2 = pq(response_title.content)
    # this finds the song lyrics
  verse = doc2('.verse')
    # this prints the song lyrics
  print verse.text()

答案 1 :(得分:0)

如果你想使用BeautifulSoup的所有文字:

r = requests.get('http://www.metrolyrics.com/drake-lyrics.html')
soup = (a["href"] for a in BeautifulSoup(r.content).find_all("a", "title", href=True))
verses = (BeautifulSoup(requests.get(url).content).find_all("p", "verse") for url in soup)

for verse in verses:
    print([v.text for v in verse])