如何删除python中标记之间的额外空间或间隙

时间:2016-02-27 07:11:11

标签: python web-scraping beautifulsoup

您好我正在网站div网站上抓一个li标签。我得到了很多空间的输出,如何从标签中删除额外的空间 我正在使用python 3.5.1和BeautifulSoup进行抓取 我的输出:

$ cat animal.txt
pig
cat
monkey
elephant

我希望输出像

[<li>

        GUANGZHOU ADS AUDIO SCIENCE &amp; TECHNOLOGY CO.,LTD.

            </li>, <li>

              SHIMA ADS INDUSTRIAL DISTRICT GUANGZHOU GUANGDONG CHINA

            </li>, <li>

        GUANGDONGGUANGZHOU

            </li>, <li>

              510440

            </li>, <li>

              http://www.adsaudio.cc

            </li>]
[<li>

        GUANGDONG TEXTILES IMPORT &amp; EXPORT COMPANY LTD.

            </li>, <li>

              GUANGDONG ,NO.168 XIAO BEI RD.,GUANGZHOU

            </li>, <li>

        GUANGDONGGUANGZHOU

            </li>, <li>

              510045

            </li>, <li>

              http://www.gdtex.com

            </li>]

如何删除额外的空格或间隙

2 个答案:

答案 0 :(得分:2)

您可以使用BeautifulSoup的get_text方法

items = soup.find_all("li")
for item in items:
    print item.get_text().strip()

答案 1 :(得分:0)

尝试在Beautiful Soup上找回的文字上使用strip

我们假设您正在使用类似内容从li标记中提取文字:text = soup.find('li').get_text(),然后在文字strip()上添加对text.strip()的调用这应该删除两端的空格。

from bs4 import BeautifulSoup

def get_li_texts(html):
  soup = BeautifulSoup(html)
  li_list = soup.findAll('li')

  li_texts = []
  for li in li_list:
    text = li.get_text().strip()
    li_texts.append(text)
  return li_texts

html = '<li>\n\n        GUANGZHOU ADS AUDIO SCIENCE &amp; TECHNOLOGY CO.,LTD.\n\n            </li>, <li>\n\n              SHIMA ADS INDUSTRIAL DISTRICT GUANGZHOU GUANGDONG CHINA\n\n            </li>, <li>\n\n        GUANGDONGGUANGZHOU\n\n            </li>, <li>\n\n              510440\n\n            </li>, <li>\n\n              http://www.adsaudio.cc\n\n            </li>'
texts = get_li_texts(html)
>> [u'GUANGZHOU ADS AUDIO SCIENCE & TECHNOLOGY CO.,LTD.',
>> u'SHIMA ADS INDUSTRIAL DISTRICT GUANGZHOU GUANGDONG CHINA',
>> u'GUANGDONGGUANGZHOU',
>> u'510440',
>> u'http://www.adsaudio.cc']