从GitHub Repo刮取文件路径产生400响应,但在浏览器中查看工作正常

时间:2017-10-05 01:14:36

标签: python python-3.x web-scraping beautifulsoup python-requests

我正试图从这样的链接中抓取所有文件路径:https://github.com/themichaelusa/Trinitum/find/master,而根本不使用GitHub API。

上面的链接在HTML中包含data-url属性(table,id ='tree-finder-results',class ='tree-browser css-truncate'),用于创建如下的URL: https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd

显示此词典:

{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]}

在Chrome浏览器中查看时。但是,GET请求产生<[400] Response>

以下是我使用的代码:

username, repo = ‘themichaelusa’, ‘Trinitum’
ghURL = 'https://github.com'
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo))
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
repoContent = soup.find('div', class_='tree-finder clearfix')
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url'])
filePaths = requests.get(fileLinksURL)
print(filePaths)

不确定它有什么问题。我的理论是第一个链接创建了一个cookie,允许第二个链接显示我们所针对的仓库的文件路径。我只是不确定如何通过代码实现这一点。非常感谢一些指点!

1 个答案:

答案 0 :(得分:0)

试一试。包含.py文件的链接是动态生成的,因此要捕获它们,您需要使用selenium。我想这就是你的期望。

from selenium import webdriver ; from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'https://github.com/themichaelusa/Trinitum/find/master'
driver=webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
for link in soup.select('#tree-finder-results .js-tree-finder-path'):
    print(urljoin(url,link['href']))

部分结果:

https://github.com/themichaelusa/Trinitum/blob/master
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE
https://github.com/themichaelusa/Trinitum/blob/master/README.md
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py