脚本跳过存在的元素 - BeautifulSoup

时间:2018-06-04 18:09:56

标签: python python-3.x beautifulsoup

我有一个循环遍历多个页面的脚本。它在大多数情况下都有效,但即使元素存在,我也会收到TypeError: 'NoneType' object is not subscriptable链接的错误。我添加了一个if,else语句,允许脚本运行,但它留下了一个空白字段,用于记录或两个应该存在的记录上的链接。这是我的工作脚本,带有if,else语句。关于如何在没有if,else语句的情况下使其工作的任何建议?

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

base_url = "https://www.doabooks.org/"

books = []
n = 5
for i in range(1, n+1):
    if (i == 1):
        # handle first page
        response = urlopen(base_url)
    response = urlopen(base_url + "doab?func=browse&page=" + str(i) + "&queryField=A&uiLanguage=en")
    page_html = response.read()
    response.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.findAll("div",{"class":"data"})

    for container in containers:
       item = {}
       item['type'] = "Open Access Book"
       item['title'] = container.span.text.strip()
       item['author'] = container.a.text
       if container.find('a', {'itemprop' : 'url'}):
          item['link'] = "https://www.doabooks.org" + container.find('a', {'itemprop' : 'url'})['href']
       else:
          item['link'] = ''
       item['source'] = "Directory of Open Access Books"
       if container.find("a",{"itemprop":"about"}):
          item['subject'] = container.find("a",{"itemprop":"about"}).text
       else:
          item['subject'] = ''
       item['base_url'] = "https://www.doabooks.org/"
       books.append(item) # add the item to the list

   with open("./json/doab-test.json", "w") as writeJSON:
       json.dump(books, writeJSON, ensure_ascii=False)

1 个答案:

答案 0 :(得分:0)

我认为这可能是一个解析器问题(我不确定)。但是我能够通过网址实现数据

import requests
from bs4 import BeautifulSoup as soup
x=requests.get("https://www.doabooks.org/doab?func=browse&page=2&queryField=A&uiLanguage=en")
print(soup(x.content).find_all("div",{"class":"data"})[5].find_all("a",{"itemprop":"url"}))

修改

我注意到删除" html.parser"因为参数对你的script.i.e完全正常。只是在声明页面汤时不传递第二个参数