Question

我有一个这样的字符串：

<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>

我想提取链接：

www.facebook.com/DoctorTaniya/posts/1906676949620646

如何编写python脚本来执行此操作？

Answer 1

我认为最好使用beautiful soup。

要解析的文本是iframe标记，其中包含src。您正在尝试在href=属性&width之后和src之前检索网址。

之后，您需要将网址解码回文本。

首先，你将它扔进美丽的汤中并从中获取属性：

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

然后你可以在这里使用正则表达式或使用.split()（非常hacky）：

# Regex
link = re.search('.*?href=(.*)?&', src_attribute).group(1)

# .split()
link = src_attribute.split("href=")[1].split("&")[0]

最后，您需要使用urllib2解码网址：

link = urllib2.unquote(link)

你完成了！

结果代码如下：

from bs4 import BeautifulSoup
import urllib2
import re

text = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'
soup = BeautifulSoup(text)

src_attribute = soup.find("iframe")["src"]

# Regex
link = re.findall('.*?href=(.*)?&', src_attribute)[0]
# .split()
link = src_attribute.split("href=")[1].split("&")[0]

link = urllib2.unquote(link)

Answer 2

Here是有关Regex在Python中查找网址的一些有用信息。

如果您编码的所有网址都在.php?href=之后启动，那么您可以创建一个在找到?href=时停止的循环并分割字符串。

或者您可以使用$_GET[]并打印它，here是您可能想要阅读的其他帖子。

Answer 3

import re

string = '<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>'

m = re.search( r'href=https%3A%2F%2F(.*)&width', string)
str2 = m.group(1)
str2.replace('%2F', '/')

输出

>>> str2.replace('%2F', '/')
'www.facebook.com/DoctorTaniya/posts/1906676949620646'

Answer 4

使用.ascx，BeautifulSoup和re的组合：

urllib

它解析from bs4 import BeautifulSoup import re, urllib html = """ <iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2FDoctorTaniya%2Fposts%2F1906676949620646&width=500" width="500" height="482" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe> <p>some other rubbish here</p> """ # da soup soup = BeautifulSoup(html, 'html5lib') # href, (anything not &) afterwards rx = re.compile(r'href=([^&]+)') for iframe in soup.findAll('iframe'): link = urllib.unquote(rx.search(iframe['src']).group(1)) print(link) # https://www.facebook.com/DoctorTaniya/posts/1906676949620646，查找iframe，使用正则表达式分析这些iframe并取消引用找到的网址。因此，您不会直接对DOM采取行动。

如何使用python从嵌入式链接中提取链接？

4 个答案: