我使用beautifulsoup和urllib来提取网页,我已经设置了用户代理和cookie,但是我无法从网页上收到所有链接... 继承我的代码:
import bs4 as bs
import urllib.request
import requests
#sauce = urllib.request.urlopen('https://github.com/search?q=javascript&type=Code&utf8=%E2%9C%93').read()
#soup = bs.BeautifulSoup(sauce,'lxml')
'''
session = requests.Session()
response = session.get(url)
print(session.cookies.get_dict())
'''
url = 'https://github.com/search?q=javascript&type=Code&utf8=%E2%9C%93'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Cookie' : '_gh_sess=eyJzZXNzaW9uX2lkIjoiMDNhMGI2NjQxZjY4Mjc1YmQ3ZjAyNmJiODM2YzIzMTUiLCJfY3NyZl90b2tlbiI6IlJJOUtrd3E3WVFOYldVUzkwdmUxZ0Z4MHZLN3M2eE83SzhIdVJTUFVsVVU9In0%3D--4485d36d4c86aec01cde254e34db68005193546e
logged_in: no'}
response = requests.get(url,headers=headers)
print(response.cookies)
soup = bs.BeautifulSoup(response.content,'lxml')
for url in soup.find_all('a'):
print(url.get('href'))
有什么我想念的吗?在浏览器中我获得了所有代码的链接,而在脚本中我只获得了一些链接,没有代码...