Question

我正在尝试从watchseriesfree.to网站上的特定文件夹中提取一些链接。在下面的例子中我想要快速视频链接，所以我使用正则表达式来过滤掉包含rapidvideo的文本的那些标签

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

但是，上面的代码返回一个空白列表。我做错了什么？

Answer 1

问题在于：

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

当BeautifulSoup将应用您的文本正则表达式模式时，它将使用所有匹配的tr元素的.string attribute值。现在，.string有一个重要的警告 - 当一个元素有多个子元素时，.string为None ：

如果某个代码包含多个内容，则不清楚.string应引用的内容，因此.string定义为None。

因此，你没有结果。

您可以使用searching function并致电tr来检查.get_text()元素的实际文字：

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())

正则表达式不适用于bs4

1 个答案: