Question

我想从推文中提取缩短的网址（如果有的话）。这些网址遵循标准格式：http://t.co（details here）

为此，我使用了以下正则表达式，当我使用推文文本测试它时，只需将文本存储为字符串即可正常工作。

注意： 我使用https://shortnedurl/string而不是真正缩短的URL，因为StackOverflow不允许在此处发布此类URL。

示例代码：

import re

tweet = "Grim discovery in the USS McCain collision probe https://shortnedurl.com @MattRiversCNN reports #TheLead"

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                  tweet)
for url in urls:
    print "printing urls", url

此代码的输出：

printing urls https://shortnedurl.com

然而，当我使用其API阅读推特上的推文并在其上运行相同的正则表达式时，我得到以下输出，这是不可取的。

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string</a></span>
printing urls https://twitter.com/MattRiversCNN
printing urls https://twitter.com/search?q=%23TheLead

似乎它正在获取Twitter ID的URL以及标签。

我该如何处理这个问题？我只想阅读这些http://t.co个网址。

UPDATE1： 我试过https？：//t.co/ \ S *，但是，我仍然得到以下嘈杂的网址：

printing urls https://https://shortnedurl/string
printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

我不知道为什么使用</a><span>再次找到相同的网址。

对于https？：//t.co/ \ S +，我收到了无效的网址，因为它将上述两个网址合二为一：

printing urls https://https://shortnedurl/string>https://https://shortnedurl/string</a></span>

UPDATE2： 推文文字看起来与我的预期略有不同：

    Grim discovery in the USS McCain collision probe 
<span class="link"><a href="https://shortenedurl">https://shortenedurl</a></span> <span class="username"><a 
href="https://twitter.com/MattRiversCNN">@MattRiversCNN</a></span>
     reports <span class="tag"><a href="https://twitter.com/search?
    q=%23TheLead">#TheLead</a></span>

Answer 1

如果我理解正确，只需将您想要包含的字符串放在正则表达式中，如下所示：

https?://shortnedurl.com/\S*
# look for http or https:://
# shortnedurl.com/ literally
# followed by anything not a whitespace character, 0+

请参阅a demo on regex101.com 对于您的特殊情况：

https?://t\.co/\S*

Answer 2

你可以使用正则表达式

https?://t\.co/\S+

如何从推文中提取或获取所有缩短的URL？

2 个答案: