Python正则表达式删除URL的不需要的部分

时间:2013-11-10 01:44:05

标签: python regex python-2.7

所以我有这些不断变化的网址:

http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

但我想剥离变化的第一部分,然后左转:

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

我会使用什么正则表达式删除所有内容?

我无法使用“startswith()”,因为该网址中的“usg”数字会发生变化。

4 个答案:

答案 0 :(得分:3)

使用合适的工具完成工作;使用urlparse module解析查询字符串:

import urlparse

qs = urlparse.urlsplit(inputurl).query
url = urlparse.parse_qs(qs).get('url', [None])[0]

如果网址查询字符串中没有url元素,则会将None设置为url=,否则为URL值。

演示:

>>> import urlparse
>>> inputurl = 'http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'
>>> qs = urlparse.urlsplit(inputurl).query
>>> urlparse.parse_qs(qs).get('url', [None])[0]
'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'

答案 1 :(得分:1)

为什么不

print data.split("&url=", 1)[1].split("&", 1)[0]

示例运行

data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
print data.split("&url=", 1)[1].split("&", 1)[0]

<强>输出

http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/

答案 2 :(得分:1)

这样可以正常工作:

url = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"

In [148]: url.split('&url=')[1]
Out[148]: 'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'

我会使用urlparse.parse_qs(url)作为评论中提到的@MartijnPieters。

答案 3 :(得分:1)

请注意,“&amp; url =”右侧的内容不是网址。它是 url-encoded url 。因此,例如,如果原始网址包含“&amp;”,则其将包含“%26”。使用它而不解码将适用于许多网址,但一般不保证。

正如Martjin所说,这将始终正常工作:

import urlparse
data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
o = urlparse.urlparse(data)
q = urlparse.parse_qs(o.query)
print q['url']