用beautifulsoup4展开元素:它是否会影响父元素的.string?

时间:2015-02-15 17:01:15

标签: html web-scraping beautifulsoup python-3.4

我正在抓取一个表中的文本数据,如下所示,我希望获得结果:

  

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
  sed do eiusmod tempor incididunt ut labore et dolore magna   alqua。使用adim veniam,quis nostrud exercitation   ullamco laboris nisi ut aliquip ex ea commodo consequat。

    html = '''
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

我用beautifulsoup4:

打开了<span>元素
soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

但是,我为所有空的<td>元素提出了空行,或者行&#39; dolor坐在那里。不打印,即使我在用美化打印html时也能看到它。

# text with empty lines
for line in soup.find_all('td'):
    print(line.get_text().strip())
    print(line.string) # line with <span> prints None

# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
    print(line.get_text().strip())

print(soup.prettify())

我做错了吗?如何使用unwrap()并仍然访问所有文本内容而不使用空行?

感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

我可以测试,你就在附近。应用strip(),然后使用re模块将多个空格替换为仅一个,例如:

from bs4 import BeautifulSoup
import re

html = ''' 
<table>
<tr class="title last ">
  <td>
   Lorem ipsum
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   <span class="caps">dolor
   </span>
   sit amet
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   consectetur adipiscing elit,
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
    Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
  </td>
  <td>
  </td>
 </tr>
</table>
'''

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
    tag.unwrap()

print('\n'.join(
  re.sub(r'\s+', ' ', td.text.strip()) 
    for td in soup.find_all('td') if td.text.strip()))

它产生:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.