Question

所以我有这些不断变化的网址：

http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

但我想剥离变化的第一部分，然后左转：

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

我会使用什么正则表达式删除所有内容？

我无法使用“startswith（）”，因为该网址中的“usg”数字会发生变化。

Answer 1

使用合适的工具完成工作;使用urlparse module解析查询字符串：

import urlparse

qs = urlparse.urlsplit(inputurl).query
url = urlparse.parse_qs(qs).get('url', [None])[0]

如果网址查询字符串中没有url元素，则会将None设置为url=，否则为URL值。

演示：

>>> import urlparse
>>> inputurl = 'http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'
>>> qs = urlparse.urlsplit(inputurl).query
>>> urlparse.parse_qs(qs).get('url', [None])[0]
'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'

Answer 2

为什么不

print data.split("&url=", 1)[1].split("&", 1)[0]

示例运行

data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
print data.split("&url=", 1)[1].split("&", 1)[0]

<强>输出

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

Answer 3

这样可以正常工作：

url = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"

In [148]: url.split('&url=')[1]
Out[148]: 'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'

我会使用urlparse.parse_qs(url)作为评论中提到的@MartijnPieters。

Answer 4

请注意，“＆amp; url =”右侧的内容不是网址。它是 url-encoded url 。因此，例如，如果原始网址包含“＆amp;”，则其将包含“％26”。使用它而不解码将适用于许多网址，但一般不保证。

正如Martjin所说，这将始终正常工作：

import urlparse
data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
o = urlparse.urlparse(data)
q = urlparse.parse_qs(o.query)
print q['url']

Python正则表达式删除URL的不需要的部分

4 个答案: