Question

我正在尝试使用ScraperWiki学习Python和Beautiful Soup。我想要一份埃德蒙顿所有kickstarter项目的清单。

我已经成功抓取了我正在寻找的页面并提取了我想要的数据。我无法将数据格式化并导出到数据库。

控制台输出：

Line 42 - url = link["href"]

/usr/local/lib/python2.7/dist-packages/bs4/element.py:879 -- __getitem__((self=<h2 class="bbcard_nam...more

KeyError: 'href'

代码：

import scraperwiki
from bs4 import BeautifulSoup

search_page ="http://www.kickstarter.com/projects/search?term=edmonton"
html = scraperwiki.scrape(search_page)
soup = BeautifulSoup(html)

max = soup.find("p", { "class" : "blurb" }).get_text()
num = int(max.split(" ")[0])

if num % 12 != 0:
    last_page = int(num/12) + 1
else:
    last_page = int(num/12)

for n in range(1, last_page + 1):
    html = scraperwiki.scrape(search_page + "&page=" + str(n))
    soup = BeautifulSoup(html)
    projects = soup.find_all("h2", { "class" : "bbcard_name" })
    counter = (n-1)*12 + 1
    print projects

    for link in projects:
        url = link["href"]
        data = {"URL": url, "id": counter}
#save into the data store, giving the unique parameter
        scraperwiki.sqlite.save(["URL"],data)
        counter+=1

项目中有href个锚点。如何从<h2>循环中的每个for元素获取网址？

Answer 1

好吧，你要求<h2>标签，这就是BeautifulSoup给你的东西。显然，这些属性都不具有href属性，因为标头不能具有href属性。

说for link in projects仅仅为projects中的每个项目（二级标题）提供了名称 link，它并不神奇地把它们变成链接。

冒着看似侮辱性明显的风险，如果你想要链接，那么找一下<a>标签......？或者您可能希望每个标题中的所有链接 ...例如

for project in projects:
   for link in project.find_all("a"):

或者也许不要去寻找项目并直接寻找链接：

for link in soup.select("h2.bbcard_name a"):

Answer 2

您正在寻找href代码中的<h2>属性。

这段代码：

for link in projects:

遍历projects，其中包含<h2>个标记，而非链接。

我不太清楚你想要什么，但我想你想在href标签内找到<a>标签的<h2>属性，试试这个：

data = {"URL":[], "id":counter}
for header in projects: #take the header)
    links = header.find_all("a")
    for link in links:
        url = link["href"]

此外，data = {"URL": url, "id": counter}会覆盖每个循环上的字典data。所以改成它：

data["URL"].append(url) # store it on this format {'URL':[link1,link2,link3]}

在scraperwiki上从beautifulsoup发送数据到sqlite但得到KeyError：'href'

2 个答案: