Question

这是示例test.html文件的一部分：

<html>
<body>
<div>
...
...
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  My College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                My Name
               </font>
               <br/>
              </h4>
              My Address
              <br/>
              My City, XY 19604
              <br/>
              My Country
              <br/>
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere.edu
              </a>
              <br/>
              Website:
              <a href="http://www.nowhere.edu" target="newwindow">
               http://www.nowhere.edu
              </a>
              <br/>
              <br/>
              <br/>
             </td>
              ...
              ...
</table>
<hr/>
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  His College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                His name
               </font>
               <br/>
              </h4>
              His Address
              <br/>
              His City, YX 49506
              <br/>
              His Country
              <br/>
              <br/>
              Phone: XX-YY-ZZ
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere2.edu
              </a>
              <br/>
              Website:
              <a href="http://nowhere2.edu/" target="newwindow">
               http://nowhere2.edu
              </a>
              <br/>
              <br/>
              ...
              ...
</table>
...
...
</div>
</body>
</html>

我想要的输出：

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu

His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu

起初我尝试过：

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(table.get_text())

它以换行形式打印文本，但产生一堆blank lines和white spaces：



         My College

      My Name
...

然后我尝试了：

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        texts = ' '.join(table.text.split())
        print(texts)

它将删除blank lines和white spaces，但将所有文本合并在一行中：

My College My Name My Address ... ... http://www.nowhere2.edu

最后，我尝试使用strip() stripped_strings()方法，并且还尝试使用<br>方法将\n替换为replace_with()。但是我仍无法成功打印出确切的输出。

Answer 1

尝试用换行符代替空格：

from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')
    for table in tables:
        texts = '\n'.join(table.text.split())
        print(texts)

编辑：先前的代码片段会将您的多条字线分解为一条字线，请尝试以下方法：

from bs4 import BeautifulSoup    
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')    
    tables = soup.find_all('table', class_='width-max')    
    for table in tables:
        if !table.get_text().isspace():
            text = os.linesep.join([l for l in table.get_text().splitlines() if l])
            print(text.lstrip())

Answer 2

只需更改您的打印声明并在其中添加换行符

print('\n' + texts)

Answer 3

您需要清除table.get_text()值才能依次打印每一行。
使用2个正则表达式，您可以通过

from bs4 import BeautifulSoup
import re

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(re.sub(r"(\n)+", r"\n", re.sub(r" {3,}", "", table.get_text().replace('...', ''))) , end="")

这将输出

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu    

His College
His name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://nowhere2.edu

第一个正则表达式{3,}将删除所有3条或更多空行，第二个"(\n)+", "\\n"将用\ n替换\ n多于一个\ n，这将使打印功能通过以下方式打印数据行：线。
另外，为了匹配您期望的输出，添加了get_text().replace('...', '')以便从文本中删除...。

如何使用BeautifulSoup逐行打印文本？

3 个答案: