我是BeautifulSoup
的新手。我试图用requests
解析HTML网页。我现在写的代码是:
import requests
from bs4 import BeautifulSoup
link = "SOME_URL"
f = requests.get(link)
soup = BeautifulSoup(f.text, 'html.parser')
for el in (soup.findAll("td",{"class": "g-res-tab-cell"})):
print(el)
exit
输出如下:
<td class="g-res-tab-cell">
<div style="padding:8px;">
<div style="padding-top:8px;">
<table cellspacing="0" cellpadding="0" border="0" style="width:100%;">
<tr>
<td valign="top">
<div itemscope itemtype="URL">
<table cellspacing="0" cellpadding="0" style="width:100%;">
<tr>
<td valign="top" class="g-res-tab-cell" style="width:100%;">
<div style="width:100%;padding-left:4px;">
<div class="subtext_view_med" itemprop="name">
<a href="NAME1-URL" itemprop="url">NAME1</a>
</div>
<div style="direction:ltr;padding-left:5px;margin-bottom:2px;" class="smtext">
<span class="Gray">In English:</span> ENGLISH_NAME1
</div>
<div style="padding-bottom:2px;padding-top:8px;font-size:14px;text-align:justify;min-height:158px;" itemprop="description">DESCRIPTION1</div>
</div>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
<table cellspacing="0" cellpadding="0" border="0" style="width:100%;">
<tr>
<td valign="top">
<div itemscope itemtype="URL">
<table cellspacing="0" cellpadding="0" style="width:100%;">
<tr>
<td valign="top" class="g-res-tab-cell" style="width:100%;">
<div style="width:100%;padding-left:4px;">
<div class="subtext_view_med" itemprop="name">
<a href="NAME2-URL" itemprop="url">NAME2</a>
</div>
<div style="direction:ltr;padding-left:5px;margin-bottom:2px;" class="smtext">
<span class="Gray">In English:</span> ENGLISH_NAME2
</div>
</div>
<div style="padding-bottom:2px;padding-top:8px;font-size:14px;text-align:justify;min-height:158px;" itemprop="description">DESCRIPTION2</div>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
</div>
</div>
</td>
现在我被卡住了。我正在尝试为每个块解析NAME
,DESCRIPTION
和ENGLISH_NAME
。我想打印其中的每一个,这样输出将是:
name = NAME1
en_name = ENGLISH_NAME1
description = DESCRIPTION1
name = NAME2
en_name = ENGLISH_NAME2
description = DESCRIPTION2
我尝试阅读文档,但是我找不到如何处理嵌套属性,尤其是在没有class
或id
名称的情况下。据我了解,每个块都以<table cellspacing="0" cellpadding="0" border="0" style="width:100%;">
开头。在每个块中,我应该找到具有a
的标记itemprop="url"
并得到NAME
。然后在<span class="Gray">In English:</span>
中获得en_name
,在itemprop="description"
中获得description
。但是我觉得BeautifulSoup
无法做到(或者至少很难做到)。如何解决?
答案 0 :(得分:0)
您可以使用td
遍历类g-res-tab-cell
的每个soup.find_all
:
from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser').td.find_all('td', {'class':'g-res-tab-cell'})
results = [[i.find('div', {'class':'subtext_view_med'}).a.text, i.find('div', {'class':'smtext'}).contents[1].text, i.find('div', {'itemprop':'description'}).text] for i in d]
输出:
[['NAME1', 'In English:', 'DESCRIPTION1'], ['NAME2', 'In English:', 'DESCRIPTION2']]
编辑:来自链接:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.sratim.co.il/browsenewmovies.php?page=1').text, 'html.parser')
movies = d.find_all('div', {'itemtype':'http://schema.org/Movie'})
result = [[getattr(i.find('a', {'itemprop':'url'}), 'text', 'N/A'), getattr(i.find('div', {'class':'smtext'}), 'text', 'N/A'), getattr(i.find('div', {'itemprop':'description'}), 'text', 'N/A')] for i in movies]
答案 1 :(得分:0)
这是另一种方式。由于所有电影都具有该信息,因此您应该具有完全填充的结果集。
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://www.sratim.co.il/browsenewmovies.php?page=1')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('[itemprop=url]')] #32
english_names = [item.next_sibling for item in soup.select('.smtext:contains("In English: ") span')]
descriptions = [item.text for item in soup.select('[itemprop=description]')]
results = list(zip(names, english_names, descriptions))
df = pd.DataFrame(results, columns = ['Name', 'English_Name', 'Description'])
print(df)