Question

请帮助python。我曾尝试使用python抓取网页。当我尝试在这个网址中获取iframe src时，它只给我一个iframe源代码。

这是我试图抓的网页。

来源1

来源2

这是我的python代码：

iframe = re.compile( '<iframe.*src="(.*?)"' ).findall( html )

这个只给我1个iframe。但是有4个iframe

谢谢

Answer 1

强烈建议Beautiful Soup。对于Python，{{3}}是一个广泛使用的选项，可以为您解析。

要提取您的<iframe/>来源，您可以使用类似

的内容

from bs4 import BeautifulSoup
import requests

resp = requests.get(url)
soup = BeautifulSoup(resp.text)
for frame in soup.findAll('iframe'):
    print(frame['src'])

对于您指定的URL，这将产生以下结果

http://www.playhd.video/embed.php?vid=xxx
http://mersalaayitten.com/embed/xxx
http://www.playhd.video/embed.php?vid=xxx
http://googleplay.tv/videos/kanithan?iframe=true
//www.facebook.com/plugins/likebox.php?href=https%3A%2F%2Fwww.facebook.com%2Fkathaltamilmovie&width=600&height=188&colorscheme=light&show_faces=true&header=false&stream=false&show_border=true

Answer 2

如果您只想要四个在一起，您可以使用BeautifulSoup css-selectors从第二个表中获取包含四个iframe的数据，特别是section .data ;New line string NEWLINE: db 0xa, 0xd LENGTH: equ $-NEWLINE section .bss INPT: resd 1 section .text global _start _start: ;Read character mov eax, 0x3 mov ebx, 0x1 mov ecx, INPT mov edx, 0x1 int 80h ;print character mov eax, 0x4 mov ebx, 0x1 mov ecx, INPT mov edx, 0x1 int 80h ;Print new line after the output mov eax, 0x4 mov ebx, 0x1 mov ecx, NEWLINE mov edx, LENGTH int 0x80 ;Terminate mov eax, 0x1 xor ebx, ebx int 0x80以获取第二个表：

nth-of-type(2)

这将只给你四个：

from bs4 import BeautifulSoup
import requests

html = requests.get("http://kathaltamil.com/?v=Kanithan").content
soup = BeautifulSoup(html)

urls = [ifr["src"] for ifr in soup.select("table:nth-of-type(2)")[0].select("iframe")]

使用lxml和xpath更简单：

['http://www.playhd.video/embed.php?vid=621', 
'http://mersalaayitten.com/embed/3752', 
'http://www.playhd.video/embed.php?vid=584', 
'http://googleplay.tv/videos/kanithan?iframe=true']

它给你的相同：

import requests

html = requests.get("http://kathaltamil.com/?v=Kanithan").content


from lxml.etree import fromstring, HTMLParser

xml = fromstring(html, HTMLParser())

print(xml.xpath("//table[2]//iframe/@src"))

无论你选择什么，都会比你的正则表达式更好。

Answer 3

好像你在第一个之后忘记了一个问号（?）。* 正确的方法是：

iframe = re.compile( '<iframe.*?src="(.*?)"' ).findall( html )

总的来说，请记住，正则表达式不是解析html网页的好方法。美丽的汤，lxml，scrapy和其他库将更加高效和强大。

Python请求出现故障

3 个答案: