我目前正在运行以下代码:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
def hltv_match_list(max_offset):
offset = 0
while offset < max_offset:
url = 'http://www.hltv.org/?pageid=188&offset=' + str(offset)
base = "http://www.hltv.org/"
soup = BeautifulSoup(requests.get("http://www.hltv.org/?pageid=188&offset=0").content, 'html.parser')
cont = soup.select("div.covMainBoxContent a[href*=matchid=]")
href = urljoin(base, (a["href"] for a in cont))
# print([urljoin(base, a["href"]) for a in cont])
get_hltv_match_data(href)
offset += 50
def get_hltv_match_data(matchid_url):
source_code = requests.get(matchid_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for teamid in soup.findAll("div.covSmallHeadline a[href*=teamid=]"):
print teamid.string
hltv_match_list(5)
错误:
File "C:/Users/mdupo/PycharmProjects/HLTVCrawler/Crawler.py", line 12, in hltv_match_list
href = urljoin(base, (a["href"] for a in cont))
File "C:\Python27\lib\urlparse.py", line 261, in urljoin
urlparse(url, bscheme, allow_fragments)
File "C:\Python27\lib\urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "C:\Python27\lib\urlparse.py", line 182, in urlsplit
i = url.find(':')
AttributeError: 'generator' object has no attribute 'find'
Process finished with exit code 1
我认为我在href = urljoin(base, (a["href"] for a in cont))
部分遇到问题,因为我正在尝试创建一个网址列表,我可以将其输入get_hltv_match_data
,然后捕获该网页中的各种项目。我是不是错了?
干杯
答案 0 :(得分:0)
您需要根据评论的代码加入每个href:
urls = [urljoin(base,a["href"]) for a in cont]
您正在尝试将基本网址加入generator,即(a["href"] for a in cont)
,这是没有意义的。
您还应该将网址传递给请求,否则您将一遍又一遍地请求同一页面。
soup = BeautifulSoup(requests.get(url).content, 'html.parser')