我正在编写一个网络抓取工具,用于从网站JW Pepper中删除一些乐谱数据库。我正在使用BeautifulSoup和python来做到这一点。
这是我的代码:
# a barebones program I created to scrape the description and audio file off the JW pepper website, will eventually be used in a music database
import urllib2
import re
from bs4 import BeautifulSoup
linkgot = 0
def linkget():
search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something
print("enter the name of the desired piece")
keyword = raw_input("> ") # this will add the keyword to the url
url = search + keyword
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
all_links = soup.findAll("a")
link_dict = []
item_dict = []
for link in all_links:
link_dict.append(link.get('href')) # adds a list of the the links found on the page to link_dict
item_dict.append(x for x in link_dict if '.item' in x) #sorts them occording to .item
print item_dict
linkget()
“print”命令返回:[at 0x10ec6dc80>],当我谷歌时它不会返回任何内容。
答案 0 :(得分:0)
您对列表的过滤出错了。您可以在.item
存在的情况下构建列表,而不是在单独的循环中进行过滤,如下所示:
from bs4 import BeautifulSoup
import urllib2
def linkget():
search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something
print("enter the name of the desired piece")
keyword = raw_input("> ") # this will add the keyword to the url
url = search + keyword
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
link_dict = []
item_dict = []
for link in soup.findAll("a", href=True):
href = link.get('href')
link_dict.append(href) # adds a list of the the links found on the page to link_dict
if '.item' in href:
item_dict.append(href)
for href in item_dict:
print href
linkget()
给你类似的东西:
/Festival-of-Carols/4929683.item
/Festival-of-Carols/4929683.item
/Festival-of-Carols/4929683.item
...