Web爬网:尝试检索所有歌曲,但仅返回其中一首的歌词

时间:2020-01-21 19:15:37

标签: python beautifulsoup

我想删除所有歌曲的歌词,但问题是它只能抒发一首歌的歌词。

代码段:

import requests
from bs4 import BeautifulSoup
import pandas

url = "https://www.lyrics.com/album/3769520/Now+20th+Anniversary%2C+Vol.+2"
req = requests.get(url)
html = req.content
soup = BeautifulSoup(html , 'html.parser')

tags = soup.find_all('strong')
name = ""
Length = len(tags)
Length = Length - 3 # because it gives extra things
newUrl = "https://www.lyrics.com/lyric/35873930/"
for index in range(1 , Length):
    SongName = tags[index].text.replace(" ","")
    FileName = tags[index].text + ".txt"
    newFetechedUrl = newUrl + SongName
#    print(newFetechedUrl)
    req1 = requests.get(newFetechedUrl)
    html1 = req1.content
    soup1 = BeautifulSoup(html1, 'html.parser')
    Lyrics = soup1.find_all("pre", {"id": "lyric-body-text"})
    print(Lyrics[0].text)
    req2 = requests.get(url)
    html2 = req2.content
    soup2 = BeautifulSoup(html2, 'html.parser')
    tags = soup2.find_all('strong')
#    print(tags[index].text.replace(" ",""))
    File = open(FileName,"w")
    File.close()

我想要该页面中的所有歌曲,但我不知道为什么它只给出第一首歌曲的歌词

2 个答案:

答案 0 :(得分:3)

以线性方式使用BeautifulSoup(或更准确地说,使用requests模块对网站进行ping或抓取),尤其是多次使用时,速度可能很慢且效率低下。我对您的代码进行了一些更改,添加了多线程以缩短执行时间,并使其更易于阅读。

import requests
import concurrent.futures
from bs4 import BeautifulSoup

# Parse the initial 'album' website 
req = requests.get(url)
html = req.content
soup = BeautifulSoup(html , 'html.parser')

# Find all song's links in 'album' site - these can be found under
# the 'strong' tab, and 'a' tab
links = [tag.a["href"] for tag in soup.find_all('strong')[1:-3]] 

def getLyrics(url):
    url = HOST + url # songs are found on the HOST website
    # Parse 'song' site
    req = requests.get(url)
    html = req.content
    soup = BeautifulSoup(html , 'html.parser')
    # Obtain the lyrics, which can be found under the 'pre' tab
    return soup.find('pre').text

# Use multi-threading for faster performance - I'll give a small run down:
# max_workers = number of threads - we use an individual thread for each song
with concurrent.futures.ThreadPoolExecutor(max_workers=len(links)) as executor:
    # for every song...
    for j in range(len(links)):
        # run the 'getLyrics' method on an individual thread and get the lyrics
        lyrics = executor.submit(getLyrics, links[j]).result()
        # do whatever with the lyrics ... I simply printed them
        print(lyrics)

concurrent.futures模块为多线程提供了一个很好的接口,您可以在其文档中详细了解-here

当然,您仍然可以进一步对其进行修改,使其更加高效,并根据需要进行更改-但这应该是您问题的基本解决方案。

答案 1 :(得分:1)

此代码将从每个页面提取所有歌曲标题和歌词,并将它们存储在以歌曲标题为关键字的字典中:

import requests
from bs4 import BeautifulSoup
import pandas

BASE_URL = "https://www.lyrics.com"

def get_song_and_lyrics(path):
    new_url = BASE_URL + path
    r = requests.get(new_url)
    soup = BeautifulSoup(r.content , 'html.parser')
    return soup.find('h1').text, soup.find('pre').text


url = BASE_URL + "/album/3769520/Now+20th+Anniversary%2C+Vol.+2"

r = requests.get(url)
soup = BeautifulSoup(r.content , 'html.parser')
tags = soup.find_all('strong')

song_links = []
# iterate over each song entry and grab the link to the lyrics
for s in tags:
    link = s.find('a')
    if link and link['href'].startswith('/lyric'):
        song_links.append(link['href'])

songs = {}
# then we iterate over all the lyric links and get the lyrics for each song
# those lyrics are then stored in songs[song_title]
for l in song_links:
    song,lyrics = get_song_and_lyrics(l)
    songs[song] = lyrics

例如:

print songs['Toxic']

会将歌词打印为有毒

get_song_and_lyrics函数中,我们将相对路径传递到歌曲的歌词,并使用该页面的内容创建一个新的汤对象。歌曲标题存储在第一个<h1>元素中,歌词存储在第一个<pre>元素中。