我用以下方法制作汤:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os
import string
for i in string.ascii_uppercase[:27]:
url = "https://myanimelist.net/anime.php?letter={}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
我正在尝试通过网站创建一个数据框来废弃此网站“https://myanimelist.net”等,我想进入第一步动漫标题,eps,类型
,其次详细介绍了每个动漫(这样的页面:https://myanimelist.net/anime/2928/hack__GU_Returner)我想收集用户指定包含的分数(例如:
<a href="https://myanimelist.net/profile/Tii__">Tii__</a>
和
<table border="0" width="105" cellpadding="0" cellspacing="0" class="borderClass" style="border-width: 1px;">
<tbody><tr>
<td class="borderClass bgColor1"><strong>Overall</strong></td>
<td class="borderClass bgColor1"><strong>10</strong></td>
</tr>
<tr>
<td class="borderClass" align="left">Story</td>
<td class="borderClass">10</td>
</tr>
<tr>
<td class="borderClass" align="left">Animation</td>
<td class="borderClass">9</td>
</tr>
<tr>
<td class="borderClass" align="left">Sound</td>
<td class="borderClass">9</td>
</tr>
<tr>
<td class="borderClass" align="left">Character</td>
<td class="borderClass">9</td>
</tr>
<tr>
<td class="borderClass" style="border-width: 0;" align="left">Enjoyment</td>
<td class="borderClass" style="border-width: 0;">10</td>
</tr>
</tbody></table>
你可以帮忙收集所有这些信息吗?
如果我的要求不清楚,请告诉我。
答案 0 :(得分:2)
这可以使用read_html()
函数直接使用pandas完成:
import pandas as pd
import string
df = pd.DataFrame()
for i in string.ascii_uppercase[:1]:#[:27]:
url = "https://myanimelist.net/anime.php?letter={}".format(i)
print url
tables = pd.read_html(url, header=0)
if df.empty:
df = tables[2]
else:
df = pd.concat([df, tables[2]])
print df
这将返回在给定URL处找到的所有表的列表。在您的情况下,您只需要第二个表。这将为您提供一个数据帧开始:
Unnamed: 0 Title Type Eps. Score
0 NaN A Kite add Sawa is a school girl, an orphan, ... OVA 2 6.67
1 NaN A Piece of Phantasmagoria add A collection of... OVA 15 6.25
2 NaN A Play add Music Video for the group ALT, mad... Music 1 4.62
3 NaN A Smart Experiment add Bonus short included o... Special 1 4.95
4 NaN A-Channel add Tooru and Run have been best fr... TV 12 7.04
要使用BeautifulSoup执行此操作,您可以使用以下方法:
from bs4 import BeautifulSoup
import pandas as pd
import string
import requests
columns = [u'Title', u'Type', u'Eps.', u'Score']
df = pd.DataFrame()
for i in string.ascii_uppercase[:27]:
url = "https://myanimelist.net/anime.php?letter={}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find_all('table')[2]
for tr in table.find_all('tr')[1:]:
row = [td.get_text(strip=True) for td in tr.find_all('td')[1:5]]
url_sub = tr.find('a')['href']
print url_sub
r_sub = requests.get(url_sub)
soup_sub = BeautifulSoup(r_sub.text, 'html.parser')
all_scores = [] # each title has multiple lists of scores
# Select all of the user assigned score tables
for div in soup_sub.select('div.spaceit.textReadability.word-break.pt8.mt8'):
scores = [] # scores for one block
for tr_sub in div.div.table.find_all('tr'):
scores.append([td_sub.text for td_sub in tr_sub.find_all('td')])
all_scores.append(scores)
print all_scores # These probably need adding to the row. Not all have scores.
df_row = pd.DataFrame([row], columns=columns)
if df.empty:
df = df_row
else:
df = pd.concat([df, df_row])
print df
对于每部电影,都会创建一个找到的所有分数列表,并将其附加到all_scores
,但不清楚如何将其添加到主数据框中。
例如,分数可能如下:
https://myanimelist.net/anime/320/A_Kite
[[[u'Overall', u'8'], [u'Story', u'8'], [u'Animation', u'7'], [u'Sound', u'7'], [u'Character', u'7'], [u'Enjoyment', u'8']], [[u'Overall', u'8'], [u'Story', u'8'], [u'Animation', u'10'], [u'Sound', u'0'], [u'Character', u'7'], [u'Enjoyment', u'10']], [[u'Overall', u'7'], [u'Story', u'7'], [u'Animation', u'8'], [u'Sound', u'6'], [u'Character', u'7'], [u'Enjoyment', u'8']], [[u'Overall', u'2'], [u'Story', u'2'], [u'Animation', u'2'], [u'Sound', u'2'], [u'Character', u'2'], [u'Enjoyment', u'2']]]