我正在尝试将通过网络抓取的href连接到网站的主要网址(存根),然后将其传递给full_url []。我无法解决这个问题,有什么想法吗?
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib import parse
from urllib.parse import urljoin
url = 'http://www.owgr.com/events?pageNo=1&pageSize=400&tour=Eur&year=2019'
stub = 'http://www.owgr.com'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
full_url = []
full_url_elem = soup.find_all(id='ctl5')
for item in full_url_elem:
full_url.item.find('a').get('href')
full_url.append(item(urljoin('stub', 'event_url'))
答案 0 :(得分:0)
这里有几个问题:
您正在遍历您的full_url_elem
,这很好,但是full_url.item.find('a').get('href')
应该做什么?摆脱full_url
。
您想要的是从<a>
获取item
标签,然后从中获取href
。其次,您需要修复您想要加入的内容。
for item in full_url_elem:
full_url.item.find('a').get('href') # <--- remove full_url
full_url.append(item(urljoin('stub', 'event_url')) # <--- just use the + to combine 2 strings, and what is event_url? Also, you joining the string `'stub'` and the string `'event_url'` which will return `stubevent_url`, when what you really want is the variable `stub` and variable `event_url`
它应该是这样的:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib import parse
from urllib.parse import urljoin
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
url = 'http://www.owgr.com/events?pageNo=1&pageSize=400&tour=Eur&year=2019'
stub = 'http://www.owgr.com'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
full_url = []
full_url_elem = soup.find_all(id='ctl5')
for item in full_url_elem:
event_url = item.find('a').get('href')
full_url.append(stub + event_url)
输出:
print (full_url)
['http://www.owgr.com/en/Events/EventResult.aspx?eventid=7631', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7621', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7611', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7599', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7587', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7562', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7555', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7554', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7538', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7516', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7507', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7493', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7480', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7457', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7440', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7427', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7415', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7403', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7388', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7374', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7369', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7364', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7359', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7346', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7329', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7323', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7314', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7304', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7302', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7297', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7290']