我正在使用BeautifulSoup来解析一篇HTML文章。我使用一些函数来清除html,所以我只能保留主要文章。
另外,我想将Soup Output保存到文件中。我得到的错误如下:
soup = soup.prettify("utf-8")
AttributeError: 'unicode' object has no attribute 'prettify'
源代码:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
import nltk
import argparse
def cleaner():
url = "https://www.ceid.upatras.gr/en/announcements/job-offers/full-stack-web-developer-papergo"
ourUrl = urllib2.urlopen(url).read()
soup = BeautifulSoup(ourUrl)
#remove scripts
for script in soup.find_all('script'):
script.extract()
soup = soup.find("div", class_="clearfix")
#below code will delete tags except /br
soup = soup.encode('utf-8')
soup = soup.replace('<br/>' , '^')
soup = BeautifulSoup(soup)
soup = (soup.get_text())
soup=soup.replace('^' , '<br/>')
print soup
with open('out.txt','w',encoding='utf-8-sig') as f:
f.write(soup.prettify())
if __name__ == '__main__':
cleaner()
答案 0 :(得分:2)
这是因为soup
在这些行之后不再是BeautifulSoup
或Tag
实例:
soup = (soup.get_text())
soup = soup.replace('^' , '<br/>')
它变成了一个unicode字符串,当然,它没有.prettify()
方法。
根据您所需的输出结果,您应该可以使用.get_text()
,.replace_with()
,.unwrap()
,.extract()
和其他BeautifulSoup
方法清理HTML而不是尝试将其作为常规字符串处理。