Question

我正在编写一个网络抓取工具，用于从网站JW Pepper中删除一些乐谱数据库。我正在使用BeautifulSoup和python来做到这一点。

这是我的代码：

# a barebones program I created to scrape the description and audio file off the JW pepper website, will eventually be used in a music database
import urllib2
import re
from bs4 import BeautifulSoup
linkgot = 0
def linkget():
    search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something
    print("enter the name of the desired piece")
    keyword = raw_input("> ") # this will add the keyword to the url
    url = search + keyword
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page)
    all_links = soup.findAll("a")
    link_dict = []
    item_dict = []
    for link in all_links:
         link_dict.append(link.get('href')) # adds a list of the the links found on the page to link_dict
    item_dict.append(x for x in link_dict if '.item' in x) #sorts them occording to .item
    print item_dict

linkget()

“print”命令返回：[at 0x10ec6dc80＆gt;]，当我谷歌时它不会返回任何内容。

Answer 1

您对列表的过滤出错了。您可以在.item存在的情况下构建列表，而不是在单独的循环中进行过滤，如下所示：

from bs4 import BeautifulSoup
import urllib2

def linkget():
    search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something
    print("enter the name of the desired piece")
    keyword = raw_input("> ") # this will add the keyword to the url
    url = search + keyword
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, "html.parser")

    link_dict = []
    item_dict = []

    for link in soup.findAll("a", href=True):
        href = link.get('href')
        link_dict.append(href)  # adds a list of the the links found on the page to link_dict

        if '.item' in href:
            item_dict.append(href)

    for href in item_dict:
        print href

linkget()

给你类似的东西：

/Festival-of-Carols/4929683.item
/Festival-of-Carols/4929683.item
/Festival-of-Carols/4929683.item
...

从beatifulsoup生成的python链接列表中过滤某些项目

1 个答案: