我想删除所有歌曲的歌词,但问题是它只能抒发一首歌的歌词。
代码段:
import requests
from bs4 import BeautifulSoup
import pandas
url = "https://www.lyrics.com/album/3769520/Now+20th+Anniversary%2C+Vol.+2"
req = requests.get(url)
html = req.content
soup = BeautifulSoup(html , 'html.parser')
tags = soup.find_all('strong')
name = ""
Length = len(tags)
Length = Length - 3 # because it gives extra things
newUrl = "https://www.lyrics.com/lyric/35873930/"
for index in range(1 , Length):
SongName = tags[index].text.replace(" ","")
FileName = tags[index].text + ".txt"
newFetechedUrl = newUrl + SongName
# print(newFetechedUrl)
req1 = requests.get(newFetechedUrl)
html1 = req1.content
soup1 = BeautifulSoup(html1, 'html.parser')
Lyrics = soup1.find_all("pre", {"id": "lyric-body-text"})
print(Lyrics[0].text)
req2 = requests.get(url)
html2 = req2.content
soup2 = BeautifulSoup(html2, 'html.parser')
tags = soup2.find_all('strong')
# print(tags[index].text.replace(" ",""))
File = open(FileName,"w")
File.close()
我想要该页面中的所有歌曲,但我不知道为什么它只给出第一首歌曲的歌词
答案 0 :(得分:3)
以线性方式使用BeautifulSoup
(或更准确地说,使用requests
模块对网站进行ping或抓取),尤其是多次使用时,速度可能很慢且效率低下。我对您的代码进行了一些更改,添加了多线程以缩短执行时间,并使其更易于阅读。
import requests
import concurrent.futures
from bs4 import BeautifulSoup
# Parse the initial 'album' website
req = requests.get(url)
html = req.content
soup = BeautifulSoup(html , 'html.parser')
# Find all song's links in 'album' site - these can be found under
# the 'strong' tab, and 'a' tab
links = [tag.a["href"] for tag in soup.find_all('strong')[1:-3]]
def getLyrics(url):
url = HOST + url # songs are found on the HOST website
# Parse 'song' site
req = requests.get(url)
html = req.content
soup = BeautifulSoup(html , 'html.parser')
# Obtain the lyrics, which can be found under the 'pre' tab
return soup.find('pre').text
# Use multi-threading for faster performance - I'll give a small run down:
# max_workers = number of threads - we use an individual thread for each song
with concurrent.futures.ThreadPoolExecutor(max_workers=len(links)) as executor:
# for every song...
for j in range(len(links)):
# run the 'getLyrics' method on an individual thread and get the lyrics
lyrics = executor.submit(getLyrics, links[j]).result()
# do whatever with the lyrics ... I simply printed them
print(lyrics)
concurrent.futures
模块为多线程提供了一个很好的接口,您可以在其文档中详细了解-here
当然,您仍然可以进一步对其进行修改,使其更加高效,并根据需要进行更改-但这应该是您问题的基本解决方案。
答案 1 :(得分:1)
此代码将从每个页面提取所有歌曲标题和歌词,并将它们存储在以歌曲标题为关键字的字典中:
import requests
from bs4 import BeautifulSoup
import pandas
BASE_URL = "https://www.lyrics.com"
def get_song_and_lyrics(path):
new_url = BASE_URL + path
r = requests.get(new_url)
soup = BeautifulSoup(r.content , 'html.parser')
return soup.find('h1').text, soup.find('pre').text
url = BASE_URL + "/album/3769520/Now+20th+Anniversary%2C+Vol.+2"
r = requests.get(url)
soup = BeautifulSoup(r.content , 'html.parser')
tags = soup.find_all('strong')
song_links = []
# iterate over each song entry and grab the link to the lyrics
for s in tags:
link = s.find('a')
if link and link['href'].startswith('/lyric'):
song_links.append(link['href'])
songs = {}
# then we iterate over all the lyric links and get the lyrics for each song
# those lyrics are then stored in songs[song_title]
for l in song_links:
song,lyrics = get_song_and_lyrics(l)
songs[song] = lyrics
例如:
print songs['Toxic']
会将歌词打印为有毒
在get_song_and_lyrics
函数中,我们将相对路径传递到歌曲的歌词,并使用该页面的内容创建一个新的汤对象。歌曲标题存储在第一个<h1>
元素中,歌词存储在第一个<pre>
元素中。