这是示例test.html
文件的一部分:
<html>
<body>
<div>
...
...
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
My College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
My Name
</font>
<br/>
</h4>
My Address
<br/>
My City, XY 19604
<br/>
My Country
<br/>
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere.edu
</a>
<br/>
Website:
<a href="http://www.nowhere.edu" target="newwindow">
http://www.nowhere.edu
</a>
<br/>
<br/>
<br/>
</td>
...
...
</table>
<hr/>
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
His College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
His name
</font>
<br/>
</h4>
His Address
<br/>
His City, YX 49506
<br/>
His Country
<br/>
<br/>
Phone: XX-YY-ZZ
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere2.edu
</a>
<br/>
Website:
<a href="http://nowhere2.edu/" target="newwindow">
http://nowhere2.edu
</a>
<br/>
<br/>
...
...
</table>
...
...
</div>
</body>
</html>
我想要的输出:
My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu
His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu
起初我尝试过:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
print(table.get_text())
它以换行形式打印文本,但产生一堆blank lines
和white spaces
:
My College
My Name
...
然后我尝试了:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
texts = ' '.join(table.text.split())
print(texts)
它将删除blank lines
和white spaces
,但将所有文本合并在一行中:
My College My Name My Address ... ... http://www.nowhere2.edu
最后,我尝试使用strip()
stripped_strings()
方法,并且还尝试使用<br>
方法将\n
替换为replace_with()
。但是我仍无法成功打印出确切的输出。
答案 0 :(得分:0)
尝试用换行符代替空格:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
texts = '\n'.join(table.text.split())
print(texts)
编辑: 先前的代码片段会将您的多条字线分解为一条字线,请尝试以下方法:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
if !table.get_text().isspace():
text = os.linesep.join([l for l in table.get_text().splitlines() if l])
print(text.lstrip())
答案 1 :(得分:0)
只需更改您的打印声明并在其中添加换行符
print('\n' + texts)
答案 2 :(得分:0)
您需要清除table.get_text()
值才能依次打印每一行。
使用2个正则表达式,您可以通过
from bs4 import BeautifulSoup
import re
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
print(re.sub(r"(\n)+", r"\n", re.sub(r" {3,}", "", table.get_text().replace('...', ''))) , end="")
这将输出
My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu
His College
His name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://nowhere2.edu
第一个正则表达式{3,}
将删除所有3条或更多空行,第二个"(\n)+", "\\n"
将用\ n替换\ n多于一个\ n,这将使打印功能通过以下方式打印数据行:线。
另外,为了匹配您期望的输出,添加了get_text().replace('...', '')
以便从文本中删除...。