我有点像编码新手,而且我一直试图通过使用Beautiful Soup(一个用于从HTML中提取数据的Python库)来剔除说唱天才{3}的Andre 3000的歌词。 XML文件)。我的最终目标是以字符串格式提供数据。以下是我到目前为止的情况:
from bs4 import BeautifulSoup
from urllib2 import urlopen
artist_url = "http://rapgenius.com/artists/Andre-3000"
def get_song_links(url):
html = urlopen(url).read()
# print html
soup = BeautifulSoup(html, "lxml")
container = soup.find("div", "container")
song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]
print song_links
get_song_links(artist_url)
for link in soup.find_all('a'):
print(link.get('href'))
所以我需要其他代码的帮助。如何将他的歌词变成字符串格式?然后我如何使用自然语言工具包(NLTK)来标记句子和单词。
答案 0 :(得分:4)
这是一个例子,如何抓住页面上的所有歌曲链接,关注它们并获得歌词:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"
response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
link = urljoin(BASE_URL, song_link['href'])
response = requests.get(link)
soup = BeautifulSoup(response.text)
lyrics = soup.find('div', class_='lyrics').text.strip()
# tokenize `lyrics` with nltk
请注意,此处使用requests
模块。另请注意,User-Agent
标头是必需的,因为网站会在没有它的情况下返回403 - Forbidden
。
答案 1 :(得分:1)
首先,对于每个链接,您需要下载该页面并使用BeautifulSoup进行解析。然后在该页面上查找区分属性,将歌词与其他页面内容分开。我发现< a data-editorial-state =“accepted”data-classification =“accepted”data-group =“0”>有帮助。然后在歌词页面内容上运行.find_all以获取所有歌词行。对于每一行,您可以调用.get_text()来获取每个歌词行的文本。
对于NLTK,一旦安装完毕,你就可以导入它并解析句子如下:
from nltk.tokenize import word_tokenize, sent_tokenize
words = [word_tokenize(t) for t in sent_tokenize(lyric_text)]
这将为您提供每个句子中所有单词的列表。
答案 2 :(得分:1)
GitHub / jashanj0tsingh / LyricsScraper.py将genius.com上的歌词基本刮录到文本文件中,其中每行代表一首歌曲。它以艺术家的名字作为输入。然后,生成的文本文件可以轻松地提供给您的自定义nltk
或一般解析器,以执行您想要的操作。
代码如下:
# A simple script to scrape lyrics from the genius.com based on atrtist name.
import re
import requests
import time
import codecs
from bs4 import BeautifulSoup
from selenium import webdriver
mybrowser = webdriver.Chrome("path\to\chromedriver\binary") # Browser and path to Web driver you wish to automate your tests cases.
user_input = input("Enter Artist Name = ").replace(" ","+") # User_Input = Artist Name
base_url = "https://genius.com/search?q="+user_input # Append User_Input to search query
mybrowser.get(base_url) # Open in browser
t_sec = time.time() + 60*20 # seconds*minutes
while(time.time()<t_sec): # Reach the bottom of the page as per time for now TODO: Better condition to check end of page.
mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
html = mybrowser.page_source
soup = BeautifulSoup(html, "html.parser")
time.sleep(5)
pattern = re.compile("[\S]+-lyrics$") # Filter http links that end with "lyrics".
pattern2 = re.compile("\[(.*?)\]") # Remove unnecessary text from the lyrics such as [Intro], [Chorus] etc..
with codecs.open('lyrics.txt','a','utf-8-sig') as myfile:
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
f = requests.get(link['href'])
lyricsoup = BeautifulSoup(f.content,"html.parser")
#lyrics = lyricsoup.find("lyrics").get_text().replace("\n","") # Each song in one line.
lyrics = lyricsoup.find("lyrics").get_text() # Line by Line
lyrics = re.sub(pattern2, "", lyrics)
myfile.write(lyrics+"\n")
mybrowser.close()
myfile.close()
答案 3 :(得分:0)
希望这仍然相关!我用Eminem的歌词做同样的事情,但是来自lyrics.com。它必须来自Rap Genius吗?我发现lyrics.com更容易刮伤。
让Andre 3000只是相应地更改代码。
这是我的代码;它获取歌曲链接,然后抓取这些页面以获取歌词并将歌词附加到列表中:
import re
import requests
import nltk
from bs4 import BeautifulSoup
url = 'http://www.lyrics.com/eminem'
r = requests.get(url)
soup = BeautifulSoup(r.content)
gdata = soup.find_all('div',{'class':'row'})
eminemLyrics = []
for item in gdata:
title = item.find_all('a',{'itemprop':'name'})[0].text
lyricsdotcom = 'http://www.lyrics.com'
for link in item('a'):
try:
lyriclink = lyricsdotcom+link.get('href')
req = requests.get(lyriclink)
lyricsoup = BeautifulSoup(req.content)
lyricdata = lyricsoup.find_all('div',{'id':re.compile('lyric_space|lyrics')})[0].text
eminemLyrics.append([title,lyricdata])
print title
print lyricdata
print
except:
pass
这将为您提供列表中的歌词。要打印所有标题:
titles = [i[0] for i in eminemLyrics]
print titles
获取特定歌曲:
titles.index('Cleaning out My Closet')
120
要对歌曲进行标记,请将该值(120
)插入:
song = nltk.word_tokenize(eminemLyrics[120][1])
nltk.pos_tag(song)
答案 4 :(得分:0)
即使您可以抓取网站,也不意味着您应该这样做,而是可以使用genius的API,只需从Genius API site
创建访问令牌import lyricsgenius as genius #calling the API
api=genius.Genius('youraccesstokenhere12345678901234567890isreallylongiknow')
artist=api.search_artist('The artist name here')
aux=artist.save_lyrics(format='json', filename='artist.txt',overwrite=True, skip_duplicates=True,verbose=True)#you can change parameters acording to your needs,i dont recommend using this file directly because it saves a lot of data that you might not need and will take more time to clean it
titles=[song['title'] for song in aux['songs']]#in this case for example i just want title and lyrics
lyrics=[song['lyrics'] for song in aux['songs']]
thingstosave=[]
for i in range(0,128):
thingstosave.append(titles[i])
thingstosave.append(lyrics[i])
with open("C:/whateverfolder/alllyrics.txt","w") as output:
output.write(str(thingstosave))