使用列表中的url刮擦文本(BeautifulSoup4)

时间:2013-01-24 22:56:13

标签: python python-2.7 beautifulsoup urllib2

我从一个网址获得了工作文本抓取工具。问题是我需要再刮掉25个网址。这些网址几乎相同,唯一的区别是最后一个字母。这里的代码更清晰:

import urllib2
from bs4 import BeautifulSoup

f = open('(path to file)/names', 'a')
links = ['http://guardsmanbob.com/media/playlist.php?char='+ chr(i) for i in range(97,123)]

response = urllib2.urlopen(links[0]).read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr'):
    if not tr.find('td'): continue
    for td in tr.find('td').findAll('a'):
        f.write(td.contents[0] + '\n')

我不能让这个脚本一次从列表中运行所有URL。我设法获得的是每个网址的第一首歌名。对不起我的英语不好。我希望你理解我。

1 个答案:

答案 0 :(得分:1)

I can't make this script to run all urls from list in one time.

将代码保存在一个带有一个参数*args的方法中(或者您想要的任何名称,只是不要忘记*)。 *会自动解压缩您的列表。 *没有官方名称,但有些人(包括我)喜欢称之为splat operator

def start_download(*args):
    for value in args:
        ##for debugging purposes
        ##print value

        response = urllib2.urlopen(value).read()
        ##put the rest of your code here

if __name__ == '__main__':
    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]

    start_download(links)

修改 或者您可以直接遍历链接列表并下载每个链接。

    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]
    for link in links:
         response = urllib2.urlopen(link).read()
         ##put the rest of your code here

编辑2:

要获取所有链接然后将其保存在文件中,以下是包含特定注释的整个代码:

import urllib2
from bs4 import BeautifulSoup, SoupStrainer

links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
          chr(i) for i in range(97,123)]

    for link in links:
         response = urllib2.urlopen(link).read()
         ## gets all <a> tags
         soup = BeautifulSoup(response, parse_only=SoupStrainer('a'))
         ## unnecessary link texts to be removed
         not_included = ['News', 'FAQ', 'Stream', 'Chat', 'Media',
                    'League of Legends', 'Forum', 'Latest', 'Wallpapers',
                    'Links', 'Playlist', 'Sessions', 'BobRadio', 'All',
                    'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
                    'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
                    'U', 'V', 'W', 'X', 'Y', 'Z', 'Misc', 'Play',
                    'Learn more about me', 'Chat info', 'Boblights',
                    'Music Playlist', 'Official Facebook',
                    'Latest Music Played', 'Muppets - Closing Theme',
                    'Billy Joel - The River Of Dreams',
                    'Manic Street Preachers - If You Tolerate This 
                     Your Children Will Be Next',
                    'The Bravery - An Honest Mistake', 
                    'The Black Keys - Strange Times',
                    'View whole playlist', 'View latest sessions', 
                    'Referral Link', 'Donate to BoB', 
                    'Guardsman Bob', 'Website template', 
                    'Arcsin']

         ## create a file named "test.txt"
         ## write to file and close afterwards
         with open("test.txt", 'w') as output:
             for hyperlink in soup:
                if hyperlink.text:
                    if hyperlink.text not in not_included:
                        ##print hyperlink.text
                        output.write("%s\n" % hyperlink.text.encode('utf-8')) 

以下是test.txt中保存的输出:

enter image description here

我建议您每次循环链接列表时将test.txt更改为不同的文件名(例如S歌曲标题),因为它会覆盖前一个链接。