使用Python中的BeautifulSoup 4从div标签中提取文本

时间:2017-08-13 16:35:24

标签: python html parsing beautifulsoup

我正在尝试使用BeautifulSoup4和python从div标签中提取文本。以下html代码存储在文件(example.html)

我的HTML:

<table class="NZX1058422900" cols="20" style="border-collapse: collapse; width: 1496px;" cellspacing="0" cellpadding="0" border="0">
<tbody>
<td class="A10dbmytr2499b">
<div class="VWP1058422499" alt="Total Cases: 5 - Level 1, Level 2, or On Hold 2 - Completed" title="Total Cases: 5 - Level 1, Level 2, On Hold 2 - Completed">5/2</div>
</td>
</tbody>
</table>

I want the output to look like below:
Total Cases: 
5 - Level 1, Level 2, or On Hold
2 - Completed

到目前为止,我的代码是:

from bs4 import BeautifulSoup
openFile = open("C:\\example.html")
readFile = openFile.read()
soup = BeautifulSoup(readFile, "lxml")

我尝试了以下代码但没有取得任何成功:

soup.find("div", class_="VWP1058422499")

任何人都可以提供帮助,因为如何提取上述数据?

2 个答案:

答案 0 :(得分:1)

alt = soup.find("div", {"class":"VWP1058422499"}).get("alt")
print(alt.text) #or just print(alt)

答案 1 :(得分:0)

扩展@ so1989的答案,因为您还想知道如何使用您指定的格式进行打印,我建议采用这种方法:

from bs4 import BeautifulSoup

openFile = open("C:\\example.html")
readFile = openFile.read()

soup = BeautifulSoup(readFile, "lxml")
alt = soup.find("div", {"class":"VWP1058422499"}).get("alt").split()

for i, char in enumerate(alt):
    if char == '-':
        alt[i-2] = alt[i-2] + '\n'
    if char[0] in ['-', 'C', 'L', 'o']:
        alt[i] = ' ' + alt[i]

alt = ''.join(alt)
print(alt)