所以我有这些不断变化的网址:
http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/
但我想剥离变化的第一部分,然后左转:
http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/
我会使用什么正则表达式删除所有内容?
我无法使用“startswith()”,因为该网址中的“usg”数字会发生变化。
答案 0 :(得分:3)
使用合适的工具完成工作;使用urlparse
module解析查询字符串:
import urlparse
qs = urlparse.urlsplit(inputurl).query
url = urlparse.parse_qs(qs).get('url', [None])[0]
如果网址查询字符串中没有url
元素,则会将None
设置为url=
,否则为URL值。
演示:
>>> import urlparse
>>> inputurl = 'http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'
>>> qs = urlparse.urlsplit(inputurl).query
>>> urlparse.parse_qs(qs).get('url', [None])[0]
'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'
答案 1 :(得分:1)
为什么不
print data.split("&url=", 1)[1].split("&", 1)[0]
示例运行
data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
print data.split("&url=", 1)[1].split("&", 1)[0]
<强>输出强>
http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/
答案 2 :(得分:1)
这样可以正常工作:
url = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-
LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/
11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
In [148]: url.split('&url=')[1]
Out[148]: 'http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/'
我会使用urlparse.parse_qs(url)
作为评论中提到的@MartijnPieters。
答案 3 :(得分:1)
请注意,“&amp; url =”右侧的内容不是网址。它是 url-encoded url 。因此,例如,如果原始网址包含“&amp;”,则其将包含“%26”。使用它而不解码将适用于许多网址,但一般不保证。
正如Martjin所说,这将始终正常工作:
import urlparse
data = "http://news.google.com/news/url?sa=t&fd=R&usg=AFQjCNFcQAQ4S3H5xUuU4N-LoM2I9tLxJg&url=http://www.washingtonpost.com/blogs/going-out-guide/wp/2013/11/08/dallas-buyers-club-thor-the-dark-world-and-other-new-movies-reviewed/"
o = urlparse.urlparse(data)
q = urlparse.parse_qs(o.query)
print q['url']