How would I go about reversing the process of Google's AMP api?
I am looking to take an AMP (accelerated mobile page) URL and come up with the regular (original) URL. I was wondering if anyone has the answer as to how to do this in Python (or any other language for that matter)? Any help would be greatly appreciated.
An example:
https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/
Expected output:
https://cnn.com/2018/03/08/politics/jeff-flake-anti-tariff-bill/
A second example:
https://www.google.ca/amp/s/mobile.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.amp.html
Expected output:
https://www.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.html
A third (and final) example:
https://www.google.ca/amp/s/www.theverge.com/platform/amp/2018/3/8/17097904/android-ios-smartphone-brand-loyalty
Expected output:
https://www.theverge.com/2018/3/8/17097904/android-ios-smartphone-brand-loyalty
The unfortunate thing is that the implementation of AMP appears to vary considerably. I guess one approach could be to just chop out any "amp" and surrounding dots (.) or slashes (/), however, I could imagine a scenario where that would not be the wisest approach (mainly if the page URL actually was supposed to have amp in its ending etc (and it appeared in regular browsing).
答案 0 :(得分:1)
AMP页面需要通过以下方式引用其规范版本:
<link rel="canonical" href="https://www.example.com/url/to/full/document.html">
发现页面的非AMP版本的正确方法是获取AMP文档并提取其标准链接标记的href值。
中详细了解相关信息答案 1 :(得分:1)
对于 Python 3,另一种选择是打开 url 并从响应中获取最终 url。 关注@jadelord 在 another question 上的回答:
import urllib
def resolve(url):
return urllib.request.urlopen(url).geturl()
答案 2 :(得分:0)
对于将来遇到这种情况的人,我想我会分享我的解决方案。使用来自@daKmoR的信息,我最终得出了以下信息:
import metadata_parser
page = metadata_parser.MetadataParser(url="https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/ ")
#page = metadata_parser.MetadataParser(url="https://www.google.ca/amp/s/www.theverge.com/platform/amp/2018/3/8/17097904/android-ios-smartphone-brand-loyalty/")
#print(page.metadata)
#TODO: Doesnt work for verge
print("New")
real_URL = page.get_metadata_link('url')
if real_URL:
print(real_URL)
else:
print("Boo")
如果遇到类似&#34; TLSV1_ALERT_PROTOCOL_VERSION&#34;的错误,那么您可能正在使用过时的Python版本进行编译。 &#34; metadata_parser&#34;上面引用的是available on GitHub。
编辑: 以下是@ sebastian-benz的更新代码。
import metadata_parser
#page = metadata_parser.MetadataParser(url="https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/ ")
page = metadata_parser.MetadataParser(url="https://www.google.ca/amp/s/mobile.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.amp.html")
#print(page.metadata)
#TODO: Doesnt work for verge
print("New")
#real_URL = page.get_metadata_link('url')
real_URL = page.get_url_canonical()
if real_URL:
print(real_URL)
else:
print("Boo")