网络抓取JOINURL存根和抓取的页面href

时间:2019-09-19 09:59:59

标签: pandas web-scraping python-requests

我正在尝试将通过网络抓取的href连接到网站的主要网址(存根),然后将其传递给full_url []。我无法解决这个问题,有什么想法吗?

import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib import parse
from urllib.parse import urljoin

url = 'http://www.owgr.com/events?pageNo=1&pageSize=400&tour=Eur&year=2019'
stub = 'http://www.owgr.com'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')


full_url = []

full_url_elem = soup.find_all(id='ctl5')  

for item in full_url_elem:
    full_url.item.find('a').get('href')
    full_url.append(item(urljoin('stub', 'event_url'))

1 个答案:

答案 0 :(得分:0)

这里有几个问题:

您正在遍历您的full_url_elem,这很好,但是full_url.item.find('a').get('href')应该做什么?摆脱full_url

您想要的是从<a>获取item标签,然后从中获取href。其次,您需要修复您想要加入的内容。

for item in full_url_elem:
    full_url.item.find('a').get('href')  # <--- remove full_url
    full_url.append(item(urljoin('stub', 'event_url'))  # <--- just use the + to combine 2 strings, and what is event_url? Also, you joining the string `'stub'` and the string `'event_url'` which will return `stubevent_url`, when what you really want is the variable `stub` and variable `event_url`

它应该是这样的:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib import parse
from urllib.parse import urljoin


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}

url = 'http://www.owgr.com/events?pageNo=1&pageSize=400&tour=Eur&year=2019'
stub = 'http://www.owgr.com'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')


full_url = []

full_url_elem = soup.find_all(id='ctl5')  

for item in full_url_elem:
    event_url = item.find('a').get('href')
    full_url.append(stub + event_url)

输出:

print (full_url)
['http://www.owgr.com/en/Events/EventResult.aspx?eventid=7631', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7621', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7611', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7599', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7587', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7562', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7555', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7554', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7538', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7516', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7507', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7493', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7480', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7457', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7440', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7427', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7415', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7403', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7388', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7374', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7369', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7364', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7359', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7346', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7329', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7323', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7314', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7304', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7302', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7297', 'http://www.owgr.com/en/Events/EventResult.aspx?eventid=7290']