Question

我试图从此页面中提取链接： http://www.tadpoletunes.com/tunes/celtic1/ 视图源：http://www.tadpoletunes.com/tunes/celtic1/ 但我只想要卷轴：页面中的卷轴由以下部分描述：开始：

<th align="left"><b><a name="reels">REELS</a></b></th>

结束（以下行）：

<th align="left"><b><a name="slides">SLIDES</a></b></th>

问题是如何做到这一点。我有以下代码，它使用.mid扩展名获取所有内容的链接：

def import_midifiles():
    archive_url="http://www.tadpoletunes.com/tunes/celtic1/" 
    sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
    soup=bs.BeautifulSoup(sauce,'lxml')
    tables=soup.find_all('table')
    for table in tables:
        for link in table.find_all('a',href=True):
            if link['href'].endswith('.mid'):
                listofmidis.append(archive_url + link['href'])
        if listofmidis:
            listoflists.append(listofmidis)
    midi_list = [item for sublist in listoflists for item in sublist]
    return midi_list

我无法从beautifulsoup docs中弄清楚这一点。我需要代码，因为我将重复其他网站上的活动，以便抓取数据来训练模型。

Answer 1

获得所有＆＃34; REELS＆＃34;链接，您需要执行以下操作：

获取之间的链接＆＃34; REELS＆＃34;和＆＃34;幻灯片＆＃34;如你所说。为此，首先您需要找到包含<tr>的{{1}}标记。这可以使用.find_parent()方法完成。

<a name="reels">REELS</a>

现在，您可以使用.find_next_siblings()方法获取＆＃34; REELS＆＃34;之后的所有reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')标记。当我们找到<tr>标记<tr>（或<a name="slides">SLIDES</a>）时，我们可以打破循环。

完整代码：

.find('a').text == 'SLIDES'

部分输出：

[＆＃39; http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/bashful.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid＆＃39 ;,
   ...
   ...
  ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/silspear.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid＆＃39 ;,   ＆＃39; http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid＆＃39;]

在beautifulsoup之后提取链接

1 个答案: