定义爬网程序的URL列表,语法问题

时间:2016-06-01 12:59:13

标签: syntax beautifulsoup href findall urlparse

我目前正在运行以下代码:

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

def hltv_match_list(max_offset):
    offset = 0
    while offset < max_offset:
        url = 'http://www.hltv.org/?pageid=188&offset=' + str(offset)
        base = "http://www.hltv.org/"
        soup = BeautifulSoup(requests.get("http://www.hltv.org/?pageid=188&offset=0").content, 'html.parser')
        cont = soup.select("div.covMainBoxContent a[href*=matchid=]")
        href =  urljoin(base, (a["href"] for a in cont))
        # print([urljoin(base, a["href"]) for a in cont])
        get_hltv_match_data(href)
        offset += 50

def get_hltv_match_data(matchid_url):
    source_code = requests.get(matchid_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for teamid in soup.findAll("div.covSmallHeadline a[href*=teamid=]"):
        print teamid.string

hltv_match_list(5)

错误:

  File "C:/Users/mdupo/PycharmProjects/HLTVCrawler/Crawler.py", line 12, in hltv_match_list
    href =  urljoin(base, (a["href"] for a in cont))
  File "C:\Python27\lib\urlparse.py", line 261, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'generator' object has no attribute 'find'

Process finished with exit code 1

我认为我在href = urljoin(base, (a["href"] for a in cont))部分遇到问题,因为我正在尝试创建一个网址列表,我可以将其输入get_hltv_match_data,然后捕获该网页中的各种项目。我是不是错了?

干杯

1 个答案:

答案 0 :(得分:0)

您需要根据评论的代码加入每个href:

urls  =  [urljoin(base,a["href"]) for a in cont]

您正在尝试将基本网址加入generator,即(a["href"] for a in cont),这是没有意义的。

您还应该将网址传递给请求,否则您将一遍又一遍地请求同一页面。

soup = BeautifulSoup(requests.get(url).content, 'html.parser')