Question

我目前正在运行以下代码：

import requests
from bs4 import BeautifulSoup
from urlparse import urljoin

def hltv_match_list(max_offset):
    offset = 0
    while offset < max_offset:
        url = 'http://www.hltv.org/?pageid=188&offset=' + str(offset)
        base = "http://www.hltv.org/"
        soup = BeautifulSoup(requests.get("http://www.hltv.org/?pageid=188&offset=0").content, 'html.parser')
        cont = soup.select("div.covMainBoxContent a[href*=matchid=]")
        href =  urljoin(base, (a["href"] for a in cont))
        # print([urljoin(base, a["href"]) for a in cont])
        get_hltv_match_data(href)
        offset += 50

def get_hltv_match_data(matchid_url):
    source_code = requests.get(matchid_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for teamid in soup.findAll("div.covSmallHeadline a[href*=teamid=]"):
        print teamid.string

hltv_match_list(5)

错误：

  File "C:/Users/mdupo/PycharmProjects/HLTVCrawler/Crawler.py", line 12, in hltv_match_list
    href =  urljoin(base, (a["href"] for a in cont))
  File "C:\Python27\lib\urlparse.py", line 261, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Python27\lib\urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'generator' object has no attribute 'find'

Process finished with exit code 1

我认为我在href = urljoin(base, (a["href"] for a in cont))部分遇到问题，因为我正在尝试创建一个网址列表，我可以将其输入get_hltv_match_data，然后捕获该网页中的各种项目。我是不是错了？

干杯

Answer 1

您需要根据评论的代码加入每个href：

urls  =  [urljoin(base,a["href"]) for a in cont]

您正在尝试将基本网址加入generator，即(a["href"] for a in cont)，这是没有意义的。

您还应该将网址传递给请求，否则您将一遍又一遍地请求同一页面。

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

定义爬网程序的URL列表，语法问题

1 个答案: