网络抓取-通过“兄弟”标签中的文本获取标签-美丽的汤

时间:2020-05-15 10:15:31

标签: beautifulsoup wikipedia

我试图在Wikipedia的表格中获取文本,但是在很多情况下(在这种情况下是书),我都会这样做。我想了解这本书的类型。

Html code for the page

当流派中的文本时,我需要提取包含流派的td。

我这样做了

page2 = urllib.request.urlopen(url2)

soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
    for tr in table.findAll('tr')[5:6]:
        for td in tr.findAll('td'):
            print(td.getText(separator="\n"))```

This gets me the genre but only in some pages due to the row count which differs. 

Example of page where this does not work 

https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)

Anyone knows how to search through string with "genre"? Thank you

1 个答案:

答案 0 :(得分:0)

在这种特殊情况下,您无需理会所有这些事情。只需尝试:

import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])

输出:

                     0                                       1
0   First edition cover                     First edition cover
1                Author                          J. D. Salinger
2          Cover artist               E. Michael Mitchell[1][2]
3               Country                           United States
4              Language                                 English
5                 Genre  Realistic fictionComing-of-age fiction
6             Published                           July 16, 1951
7             Publisher               Little, Brown and Company
8            Media type                                   Print
9                 Pages                          234 (may vary)
10                 OCLC                                  287628
11        Dewey Decimal                                  813.54

您可以从此处使用标准的pandas方法提取所需的内容。