从html标签分类信息

时间:2017-10-29 19:46:56

标签: python html web-scraping beautifulsoup

我的问题可能有一个更好的标题,但这里是: 我使用BeautifulSoup和findall将html中的元素返回到列表中,以下是我得到的示例:

[<div class="tightLt col span-1-3">
    <div class="middle">
        <div class="cell"><i class="sqLed middle sm yellow margRtXs "></i></div>
        <div class="cell"><span class="middle">Neutral Outlook</span></div>
    </div>
</div>,
<div class="tightLt col span-1-3">
    <div class="middle">
        <div class="cell"><i class="sqLed middle sm yellow margRtXs "></i></div>
        <div class="cell"><span class="middle"><span class="showDesk">No opinion of</span> CEO</span>
        </div>
    </div>
</div>]
[<div class="tightLt col span-1-3">
    <div class="middle">
        <div class="cell"><i class="sqLed middle sm red margRtXs "></i></div>
        <div class="cell"><span class="middle">Doesn't Recommend</span></div>
    </div>
</div>,
<div class="tightLt col span-1-3">
    <div class="middle">
        <div class="cell"><i class="sqLed middle sm red margRtXs "></i></div>
        <div class="cell"><span class="middle">Negative Outlook</span></div>
    </div>
</div>,
<div class="tightLt col span-1-3">
    <div class="middle">
        <div class="cell"><i class="sqLed middle sm yellow margRtXs "></i></div>
        <div class="cell"><span class="middle"><span class="showDesk">No opinion of</span> CEO</span>
        </div>
    </div>
</div>]

问题在于,在第一个html中,CEO批准(在这两个案例中,CEO批准的相应价值是#34;没有CEO和#34的意见;但也可能是&#34;不赞成CEO&#34 ;&#34;批准CEO&#34;)是"span"标签内列表中的第二个元素,但它是第二个html中的第三个元素。所以我不能通过使用列表索引从列表中选择元素。我怎么能解决我的问题?

以下是返回列表

的代码部分
from bs4 import BeautifulSoup
import requests
url = "https://www.glassdoor.com/Reviews/Walmart-Reviews-E715.htm"
html_content = response = requests.get(url)
soup = BS(html_content, "lxml")
        reviews = soup.find("div", id="EmployerReviews").find_all("li", class_="empReview")
        for review in reviews:
           x = soup.findAll("div", class_="cell reviewBodyCell") 
           for z in x:
               z.findAll("div", class_="tightLt col span-1-3")#returns the list that contains needed information

1 个答案:

答案 0 :(得分:1)

使用 BeautifulSoup CSS selectors进行扩展和优化的解决方案:

from bs4 import BeautifulSoup
import requests

url = "https://www.glassdoor.com/Reviews/Walmart-Reviews-E715.htm"
html_content = requests.get(url, headers={'user-agent': 'Mozilla/5.0'}).content
soup = BeautifulSoup(html_content, "lxml")

selector = "div#EmployerReviews li.empReview div.cell.reviewBodyCell span[class='showDesk']"
for x in soup.select(selector):
    print(x.parent.text)

输出:

No opinion of CEO
No opinion of CEO
No opinion of CEO
No opinion of CEO
Approves of CEO
No opinion of CEO
Approves of CEO
Approves of CEO
Approves of CEO