Question

所以我试图解析IMDB页面中http://www.imdb.com/genre/?ref_=nv_ch_gr_3

的类型和子类型的链接

现在已经能够将主要类型标签解析成可用的东西使用以下代码

table = soup.find_all("table", {"class": "genre-table"})

for item in table:
    for x in range(100):

        try:
            print(item.contents[x].find_all("h3"))
            print(len(item.contents[x].find_all("h3")))
        except:
            pass

我的输出是11组列表，里面有两个标签，如下所示

[<h3><a href="http://www.imdb.com/genre/action/?ref_=gnr_mn_ac_mp">Action <span class="normal">»</span></a></h3>, <h3><a href="http://www.imdb.com/genre/adventure/?ref_=gnr_mn_ad_mp">Adventure <span class="normal">»</span></a></h3>]
2

我理解这一点，因为容器有一类＆＃34;偶数＆＃34;和＆＃34;奇怪＆＃34;在每个容器中有两个h3标签，但我没有指定它来区分偶数和奇数，实际上我认为我在这里回答我自己的问题，我是正确的，因为它是在容器类奇数或偶数，即bs4把它放在一个列表中，只是为了表明它并由我来分隔它们？

第二个更重要的问题：

如何将每个h3链接和标题放入我设置为

的数据框中

df = pd.DataFrame(columns= ['Genre', 'Sub-Genre', 'Link'])

我已经尝试了

表示范围（2）中的y：

df.append({'Genre':'item.contents[x].find_all("h3"))[y].text)},     ignore_index = true)

这当然是嵌套在带有x的for循环中（不是单独使用）
但似乎没有用有什么想法吗？业力你的方式！

Answer 1

首先，没有必要找到所有表格，因为只有第一个表格是必要的：

table = soup.find("table", {'class': 'genre-table'})

并且由于每个其他项都是冗余的（从第一个开始），您可以像这样迭代表：

for item in list(table)[1::2]:

在此之后我们可以获得＆＃39; h3＆＃39;每行中的标记并循环遍历它们：

    row = item.find_all("h3")

    for col in row:

因为每个＆＃39; h3＆＃39; element以这种格式返回流派：＆＃39; Somegenre \ xc2 \ xbb＆＃39;我在获取文本之前删除了span元素：

        col.span.extract()
        link = col.a['href']
        genre = col.text.strip()

之后，只需通过索引将元素添加到数据框中：

        df.loc[len(df)]=[genre, None, link]

完整代码：

import pandas as pd
import requests
from bs4 import BeautifulSoup

df = pd.DataFrame(columns=['Genre', 'Sub-Genre', 'Link'])

req = requests.get('http://www.imdb.com/genre/?ref_=nv_ch_gr_3')
soup = BeautifulSoup(req.content, 'html.parser')

table = soup.find("table", {'class': 'genre-table'})

for item in list(table)[1::2]:
    row = item.find_all("h3")

    for col in row:
        col.span.extract()
        link = col.a['href']
        genre = col.text.strip()

        df.loc[len(df)] = [genre, None, link]

BeautifulSoup索引

1 个答案: