使用BeautifulSoup的Python网页爬虫,无法获取网址

时间:2016-04-08 06:33:45

标签: python-2.7 web-scraping beautifulsoup web-crawler

所以我正在尝试构建一个动态网络爬虫来获取链接中的所有网址链接。 到目前为止,我能够获得章节的所有链接,但是当我尝试从每章执行部分链接时,我的输出不打印任何内容。

我使用的代码:

#########################Chapters#######################

import requests
from bs4 import BeautifulSoup, SoupStrainer
import re


base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"

for title in range (1,4): 
url = base_url.format(title=title)
r = requests.get(url)

 for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
  if link.has_attr('href'):
    if 'chapt' in link['href']:
        href = "http://law.justia.com" + link['href']
        leveltwo(href)

#########################Sections#######################

def leveltwo(item_url):
 r = requests.get(item_url)
 soup = BeautifulSoup((r.content),"html.parser")
 section = soup.find('div', {'class': 'primary-content' })
 for sublinks in section.find_all('a'):
        sectionlinks = sublinks.get('href')
        print (sectionlinks)

2 个答案:

答案 0 :(得分:0)

通过对代码的一些小修改,我能够让它运行并输出部分。主要是,您需要修复缩进,并在调用之前定义之前的函数

#########################Chapters#######################

import requests
from bs4 import BeautifulSoup, SoupStrainer
import re

def leveltwo(item_url):
    r = requests.get(item_url)
    soup = BeautifulSoup((r.content),"html.parser")
    section = soup.find('div', {'class': 'primary-content' })
    for sublinks in section.find_all('a'):
        sectionlinks = sublinks.get('href')
        print (sectionlinks)

base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"

for title in range (1,4): 
    url = base_url.format(title=title)
    r = requests.get(url)

for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
    try:
        if 'chapt' in link['href']:
            href = "http://law.justia.com" + link['href']
            leveltwo(href)
        else:
            continue
    except KeyError:
        continue
#########################Sections#######################

输出:

/codes/alabama/2015/title-3/chapter-1/section-3-1-1/index.html /codes/alabama/2015/title-3/chapter-1/section-3-1-2/index.html /codes/alabama/2015/title-3/chapter-1/section-3-1-3/index.html /codes/alabama/2015/title-3/chapter-1/section-3-1-4/index.html etc.

答案 1 :(得分:0)

您不需要任何try / except块,您可以使用href=True与find find_all仅使用href选择锚标签或选择{{{1} 1}}如下所示,章节链接位于 ul 的第一个文章标记内,标识为a[href],因此您不需要完全过滤:

#maincontent

如果你要使用find_all,你只需要传递base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/" import requests from bs4 import BeautifulSoup def leveltwo(item_url): r = requests.get(item_url) soup = BeautifulSoup(r.content, "html.parser") section_links = [a["href"] for a in soup.select('div .primary-content a[href]')] print (section_links) for title in range(1, 4): url = base_url.format(title=title) r = requests.get(url) for link in BeautifulSoup(r.content, "html.parser").select("#maincontent ul:nth-of-type(1) a[href]"): href = "http://law.justia.com" + link['href'] leveltwo(href) 来过滤你的锚标签,只选择那些有hrefs的锚标签。