我试图在Wikipedia的表格中获取文本,但是在很多情况下(在这种情况下是书),我都会这样做。我想了解这本书的类型。
当流派中的文本时,我需要提取包含流派的td。
我这样做了
page2 = urllib.request.urlopen(url2)
soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
for tr in table.findAll('tr')[5:6]:
for td in tr.findAll('td'):
print(td.getText(separator="\n"))```
This gets me the genre but only in some pages due to the row count which differs.
Example of page where this does not work
https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)
Anyone knows how to search through string with "genre"? Thank you
答案 0 :(得分:0)
在这种特殊情况下,您无需理会所有这些事情。只需尝试:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])
输出:
0 1
0 First edition cover First edition cover
1 Author J. D. Salinger
2 Cover artist E. Michael Mitchell[1][2]
3 Country United States
4 Language English
5 Genre Realistic fictionComing-of-age fiction
6 Published July 16, 1951
7 Publisher Little, Brown and Company
8 Media type Print
9 Pages 234 (may vary)
10 OCLC 287628
11 Dewey Decimal 813.54
您可以从此处使用标准的pandas方法提取所需的内容。