Question

我有一个页面，其中包含指向该格式的.mp3 / .wav文件的链接

<a href="http://siteName/subfolder/filename.mp3">File Name</a>

我需要制作一个脚本来下载所有这些文件而不是自己下载

我知道我可以使用正则表达式做那样的事情，但我不知道怎么做？什么是最好的选择（Java，C＃，JavaScript）？

任何帮助将不胜感激

先谢谢

Answer 1

您可以使用SgmlReader来解析DOM并提取所有锚链接，然后下载相应的资源：

class Program
{
    static void Main()
    {
        using (var reader = new SgmlReader())
        {
            reader.DocType = "HTML";
            reader.Href = "http://www.example.com";
            var doc = new XmlDocument();
            doc.Load(reader);
            var anchors = doc.SelectNodes("//a/@href[contains(., 'mp3') or contains(., 'wav')]");
            foreach (XmlAttribute href in anchors)
            {
                using (var client = new WebClient())
                {
                    var data = client.DownloadData(href.Value);
                    // TODO: do something with the downloaded data
                }
            }
        }
    }
}

Answer 2

好吧，如果你想要硬核，我认为使用DOMDocument（http://php.net/manual/en/class.domdocument.php）解析页面并使用cURL检索文件，如果你对PHP没问题就可以。

我们在这里谈论了多少个文件？

Answer 3

Python的Beautiful Soup库非常适合这项任务： http://www.crummy.com/software/BeautifulSoup/

可以这样使用：

import urllib2, re
from BeautifulSoup import BeautifulSoup

#open the URL
page = urllib2.urlopen("http://www.foo.com")
#parse the page
soup = BeautifulSoup(page)
#get all anchor elements
anchors = soup.findAll("a")
#filter anchors based on their href attribute
filteredAnchors = filter(lambda a : re.search("\.wav",a["href"]) or re.search("\.mp3",a["href"]), anchors)
urlsToDownload = map(lambda a : a["href"],filteredAnchors)
#download each anchor url...

有关从其网址下载mp3的说明，请参阅此处：How do I download a file over HTTP using Python?

make脚本从页面下载所有Mp3文件

3 个答案: