Question

我正试图从网上检索乐队的整个歌词。我注意到他们使用".../firstletter/bandname/songname.html"

构建网址

这是一个例子。

http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html

我在考虑创建一个read.csv网址的功能。这部分很容易，因为我可以通过简单的复制粘贴获取标题并保存为.csv。然后，使用该向量为每个值传递函数，以构造URL名称。

但我试图阅读第一个只是为了看看它是什么样的，我发现会有太多＆＃34;清理数据＆＃34;如果我的目标是用每个歌词构建一个csv文件。

x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))

我认为我的方法不是最好的（或者我可能需要更好的数据清理策略）

Answer 1

HTML页面可以了解歌词的开始位置：

我们的许可协议禁止任何第三方歌词提供商使用azlyrics.com内容。对不起。

利用这一点，你可以检测到这个字符串，然后阅读div的所有内容：

m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")

giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.

start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start

lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n") 
#This is just an example of how to clear the remaining tags and join the text.

然后：

> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll 
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way

Answer 2

假设＆＃34;清理数据＆＃34;意味着你将通过html标签进行解析。我建议使用DOM抓取库，它只从页面中提取文本歌词，并将这些歌词保存到CSV，数据库或任何地方。这样你就不必进行任何数据清理。我不知道您使用的是哪种编程语言，但是简单的Google搜索会向您显示大量的DOM查询和解析任何语言的库。这是PHP的一个例子

http://simplehtmldom.sourceforge.net/manual.htm

$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');

// Find all images 
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);

现在你有了歌词。保存它们（代码未经测试）;

如果您使用R语言。在这里使用此库。您将能够轻松查询DOM并提取歌词。 https://github.com/hadley/rvest

从URL检索整个歌词

2 个答案: