如何刮取页面的源视图?

时间:2017-12-23 14:42:30

标签: python facebook

我正试图通过其源视图抓取一页 例如: 视图源:https://www.youtube.com/watch?v=t3-zAlsCJ4c&t=1607s

我无法通过此代码获取它:

res = requests.get('view-source:https://www.youtube.com/watch?v=t3-zAlsCJ4c&t=1607s')

它会导致如下错误:

  

追踪(最近一次呼叫最后一次):

File "C:\Users\hdtra\Desktop\In processing\Facebook_spider.py", line 31, in <module>
  res = requests.get('view-source:https://www.facebook.com/pg/vuonraunhatrang/about/?ref=page_internal')
File "C:\Program Files\Python36\lib\site-packages\requests\api.py", line 72, in get
  return request('get', url, params=params, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\api.py", line 58, in request
  return session.request(method=method, url=url, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 508, in request
  resp = self.send(prep, **send_kwargs)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 612, in send
  adapter = self.get_adapter(url=request.url)
File "C:\Program Files\Python36\lib\site-packages\requests\sessions.py", line 703, in get_adapter
  raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'view-source:https://www.facebook.com/pg/vuonraunhatrang/about/?ref=page_internal'

我如何抓取这个Viewsource链接?

使用.get()方法进行刮擦不适用于我的项目。

我没有来自该普通页面的足够信息,但是使用这个视图源窗口,它可以很好地工作。

1 个答案:

答案 0 :(得分:0)

您可以使用BeautifulSoup。

from bs4 import BeautifulSoup 
import urllib 

r =urllib.urlopen(<url_to_scrape>).read()
soup = BeautifulSoup(r) 
print(soup.prettify())