Question

我正在使用BeautifulSoup来解析一篇HTML文章。我使用一些函数来清除html，所以我只能保留主要文章。

另外，我想将Soup Output保存到文件中。我得到的错误如下：

soup = soup.prettify("utf-8")
AttributeError: 'unicode' object has no attribute 'prettify'

源代码：

#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
import nltk
import argparse

def cleaner():
    url = "https://www.ceid.upatras.gr/en/announcements/job-offers/full-stack-web-developer-papergo"
    ourUrl  = urllib2.urlopen(url).read()
    soup = BeautifulSoup(ourUrl)

    #remove scripts
    for script in soup.find_all('script'):
        script.extract()
    soup = soup.find("div", class_="clearfix")

    #below code will delete tags except /br
    soup = soup.encode('utf-8')
    soup = soup.replace('<br/>' , '^')
    soup = BeautifulSoup(soup)
    soup = (soup.get_text())
    soup=soup.replace('^' , '<br/>')

    print soup
    with open('out.txt','w',encoding='utf-8-sig') as f:
        f.write(soup.prettify())

if __name__ == '__main__':
    cleaner()

Answer 1

这是因为soup在这些行之后不再是BeautifulSoup或Tag实例：

soup = (soup.get_text())
soup = soup.replace('^' , '<br/>')

它变成了一个unicode字符串，当然，它没有.prettify()方法。

根据您所需的输出结果，您应该可以使用.get_text()，.replace_with()，.unwrap()，.extract()和其他BeautifulSoup方法清理HTML而不是尝试将其作为常规字符串处理。

＆＃39;的unicode＆＃39;对象没有属性＆＃39;美化＆＃39;

1 个答案: