Reverse AMP url with python (get regular URL from amp)

时间:2018-03-08 22:10:25

标签: python python-3.x url amp-html

How would I go about reversing the process of Google's AMP api?

I am looking to take an AMP (accelerated mobile page) URL and come up with the regular (original) URL. I was wondering if anyone has the answer as to how to do this in Python (or any other language for that matter)? Any help would be greatly appreciated.

An example:

https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/
Expected output:
https://cnn.com/2018/03/08/politics/jeff-flake-anti-tariff-bill/

A second example:

https://www.google.ca/amp/s/mobile.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.amp.html
Expected output:
https://www.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.html

A third (and final) example:

https://www.google.ca/amp/s/www.theverge.com/platform/amp/2018/3/8/17097904/android-ios-smartphone-brand-loyalty
Expected output:
https://www.theverge.com/2018/3/8/17097904/android-ios-smartphone-brand-loyalty

The unfortunate thing is that the implementation of AMP appears to vary considerably. I guess one approach could be to just chop out any "amp" and surrounding dots (.) or slashes (/), however, I could imagine a scenario where that would not be the wisest approach (mainly if the page URL actually was supposed to have amp in its ending etc (and it appeared in regular browsing).

3 个答案:

答案 0 :(得分:1)

AMP页面需要通过以下方式引用其规范版本:

<link rel="canonical" href="https://www.example.com/url/to/full/document.html">

发现页面的非AMP版本的正确方法是获取AMP文档并提取其标准链接标记的href值。

您可以在official documentation

中详细了解相关信息

答案 1 :(得分:1)

对于 Python 3,另一种选择是打开 url 并从响应中获取最终 url。 关注@jadelord 在 another question 上的回答:

import urllib
def resolve(url):
    return urllib.request.urlopen(url).geturl()

答案 2 :(得分:0)

对于将来遇到这种情况的人,我想我会分享我的解决方案。使用来自@daKmoR的信息,我最终得出了以下信息:

import metadata_parser
page = metadata_parser.MetadataParser(url="https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/ ")
#page = metadata_parser.MetadataParser(url="https://www.google.ca/amp/s/www.theverge.com/platform/amp/2018/3/8/17097904/android-ios-smartphone-brand-loyalty/")
#print(page.metadata)
#TODO: Doesnt work for verge
print("New")
real_URL = page.get_metadata_link('url')
if real_URL:
    print(real_URL)
else:
    print("Boo")

如果遇到类似&#34; TLSV1_ALERT_PROTOCOL_VERSION&#34;的错误,那么您可能正在使用过时的Python版本进行编译。 &#34; metadata_parser&#34;上面引用的是available on GitHub

编辑: 以下是@ sebastian-benz的更新代码。

import metadata_parser
#page = metadata_parser.MetadataParser(url="https://amp.cnn.com/cnn/2018/03/08/politics/jeff-flake-anti-tariff-bill/ ")
page = metadata_parser.MetadataParser(url="https://www.google.ca/amp/s/mobile.nytimes.com/2018/03/08/us/politics/trump-tariff-announcement.amp.html")
#print(page.metadata)
#TODO: Doesnt work for verge
print("New")
#real_URL = page.get_metadata_link('url')
real_URL = page.get_url_canonical()
if real_URL:
    print(real_URL)
else:
    print("Boo")