Question

以下是我在python中用于抓取和输出工作的代码

html = urlopen("http://www.imdb.com/news/top")
wineReviews = BeautifulSoup(html)
lines = []
    for headLine in imdbNews.findAll("h2"): 
    #headLine.encode('ascii', 'ignore')
    imdb_news = headLine.get_text()
    lines.append(imdb_news)
    #f = open("output.txt", "a")
    #f.write(imdb_news)
    #f.close()

#s一直是我试图摆脱Unicode错误的尝试，但它只会导致更多的错误，而我似乎无法理解这些错误。当前代码产生以下输出：

[u'Warner Bros. Brings \u2018Wonder Woman,\u2019 \u2018Suicide Squad,\u2019 \u2018Fantastic Beasts\u2019 to Comic-Con',
 u"\u2018Ghostbusters': Is There a Post-Credit Scene?",
 u'Javier Bardem Eyed for Frankenstein Role in Universal\u2019s Monster Universe (Exclusive)',
 u'\u2018Battlefield\u2019 Video Game Being Developed for TV Series by Paramount Television & Anonymous Content',
 u'\u2018Ghostbusters\u2019 Review Roundup: Critics Generally Positive On Female-Led Blockbuster',
 u'\u2018Assassin\u2019s Creed\u2019 Movie Won\u2019t Make Money, Ubisoft Chief Says',
 u"Fargo Taps The Leftovers' Carrie Coon as Female Lead in Season 3",
 u'Ridley Scott Long-Time Collaborator Julie Payne Dies at 64',
 u'Ridley Scott Longtime Collaborator Julie Payne Dies at 64',
 u'15 Highest Paid Music Stars of 2016, From The Weeknd to Taylor Swift (Photos)',
 u'South Africa\u2019s Pubcaster Draws Ire From Demonstrators, the Government',
 u'Jerry Greer, Son of Country Music Singer Craig Morgan, Dies at 19',
 u'Queen Latifah Says Racism Is "Still Alive and Kicking" at VH1 Hip Hop Honors',
 u'Jerry Greer, Son of Country Singer Craig Morgan, Found Dead After Boating Accident',
 u'[Watch] Emmy Awards movie/mini slugfest: \u2018The People v. O.J. Simpson\u2019 and \u2018Fargo\u2019 battle for the win',
 u'Amanda Evans Wraps Videovision\u2019s Thriller \u2018Serpent\u2019',
 u'\u2018Oslo\u2019 Theater Review: The Handshake That Shook the World',
 u'\u2018The Bachelorette\u2019 Recap: JoJo Tames Some Wild Horses',
 u'Disney Accelerator Names 9 Startups to Participate in 2016 Mentorship Program',
 u'Karlovy Vary Film Review: \u2018The Teacher\u2019',
 u'Top News',
 u'Movie News',
 u'TV News',
 u'Celebrity News']

如何摆脱你的伤害？和\ u2019等等。？并在txt文件中获取我的结果

Answer 1

 s = u'\u2018Battlefield\u2019'.encode('utf-8')
 with open("some_file", "w") as f:
     f.write(s)

只需将.encode（＆＃39; utf-8＆＃39;）添加到您的字符串中，然后再将其写入文件

Answer 2

<强>更新

因为您不希望字符串中包含\u个字符。这应该有效：

html = urlopen("http://www.imdb.com/news/top")
wineReviews = BeautifulSoup(html)
lines = []
    for headLine in imdbNews.findAll("h2"): 
        imdb_news = headLine.get_text()
        lines.append(imdb_news.encode('ascii', 'ignore'))
        f = open("output.txt", "a")
        f.write(imdb_news.encode('ascii', 'ignore'))
        f.close()

即，

在写入文件之前，

将Unicode个字符编码为ASCII。

你做错了是这样的：

headLine.encode('ascii', 'ignore')

这不会修改headLine，您需要将此值分配给headLine，如下所示：

headLine = headLine.encode('ascii', 'ignore')

如何在尝试在.txt文件中输出Web抓取结果时摆脱Unicode编码错误

2 个答案: