这不仅是一个简单的如何检索链接的问题。当我抓取页面时,href链接返回类似'/people/4849247002'
的内容,但是如果您检查页面本身,则单击该href URL实际上会链接到'https://website/people/4849247002'
。我该如何使用'https://website/people/4849247002'
来获得链接?
也有注释,但是使用BeautifulSoup获取网页的正确方法是什么?我一直在使用以下两项:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
和
import requests
from bs4 import BeautifulSoup
import re
import time
source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
我当前正在使用python 3.8
答案 0 :(得分:3)
另一种方法。
from simplified_scrapy import SimplifiedDoc, utils, req
url = 'https://boards.greenhouse.io/adhocexternal'
html = req.get(url)
doc = SimplifiedDoc(html)
print (doc.listA(url).url) # Print all links
# Or
lstA = doc.selects('a@data-mapped=true>href()')
print ([utils.absoluteUrl(url, a) for a in lstA])
结果:
['https://adhoc.team/join/', 'https://adhoc.team/blog/', 'https://boards.greenhouse.io/adhocexternal/jobs/4877141002', 'https://boards.greenhouse.io/adhocexternal/jobs/4877155002', 'https://boards.greenhouse.io/adhocexternal/jobs/4869701002', 'https://boards.greenhouse.io/adhocexternal/jobs/4877146002', ...
['https://boards.greenhouse.io/adhocexternal/jobs/4877141002', 'https://boards.greenhouse.io/adhocexternal/jobs/4877155002', ...
或者您可以直接使用该框架。
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'greenhouse'
start_urls = ['https://boards.greenhouse.io/adhocexternal']
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
urls = doc.listA(url.url)
data = doc.title # Whatever data you want to get
return {'Urls': urls, 'Data': data}
SimplifiedMain.startThread(MySpider()) # Start download