Question

我希望从SOUNDCLOUD站点中提取FACEBOOK或其他社交媒体配置文件URL。

示例网址： https://soundcloud.com/netztherapie

此配置文件社交媒体配置文件的链接。当为facebook使用正则表达式时，它无法捕获它。

我想知道这段代码的正则表达式是什么：

CODE：

a href="https://exit.sc?url=https%3A%2F%2Fwww.facebook.com%2FNetztherapie-641177919313762%2F" target="_blank" rel="me nofollow" class="web-profile sc-link-light sc-social-logo-interactive">
  <span class="sc-social-logo sc-social-logo-facebook"></span>
  Wir auf Facebook!
</a

我想提取：

https://www.facebook.com/Netztherapie/

Answer 1

这些方面的内容是您可能想要做的事情：

regex = "www\.facebook\.com%2F([^-]+)-"

您可能实际上不必捕获www.facebook.com或https：//，因为您知道这一点。更容易尝试获取名称，然后制作字符串。这是一个易于阅读（虽然字符串连接不是正确的Python）示例：

import re

regex = "www\.facebook\.com%2F([^-]+)-"

match = re.search(regex, """a href="https://exit.sc?url=https%3A%2F%2Fwww.facebook.com%2FNetztherapie-641177919313762%2F" target="_blank" rel="me nofollow" class="web-profile sc-link-light sc-social-logo-interactive"> Wir auf Facebook!""")
if match:
    print ("yep")
    thename = match.group(1)
    print ("https://www.facebook.com/" + thename + "/")

else:
    print ("nope")

正则表达式只是抓取所有非 - 在基本URL之后。这应该至少让你朝着正确的方向前进。你可能需要在运行一些测试后调整正则表达式，也许你不想要www。例如，在那里，我不确定所有的soundcloud URL到底有多均匀。

用于soundcloud社交媒体配置文件/网址的正则表达式

1 个答案: