Question

我试图阅读网页并将格式化文本输出到文本文件。下面的代码使用格式打印到shell，但是当我将它写入文件时，它将它放在一行（文本中有换行符/ n）。

我尝试过各种各样的东西，比如不把它转换成字符串，使用漂亮的汤美化，但似乎没有产生带格式的文本文件。我假设我遗漏了一些相当基本的东西。任何帮助或指导都将非常感激。

# Import 
from urllib.request import urlopen
from bs4 import BeautifulSoup

#The actual code


URL = "https://simple.wikipedia.org/wiki/castle" #The target URL
html = urlopen(URL).read()  # Reads the url to variable html
soup = BeautifulSoup(html, "lxml") # Uses BS4 to create the soup using the lxml parser
soup = soup.get_text() # Extracts the text
print(soup) # Prints to python 3.5.1 shell, formatted as I would expect


# Now writing what I have extracted to a text file
file = open("TextOutput.txt", 'w') # Creates the file and opens as write (w)
file.writelines(str(soup.encode('UTF-8'))) # Tried file.write/lines(soup), convertion to string and encoding as UTF-8 needed to avoid errors
file.close()

文件输出的示例如下：

b＆＃39; \ n \ n \ nCastle - 简体英语维基百科，免费的百科全书\ ndocument.documentElement.className = document.documentElement.className.replace（/（^ | \ s）client-nojs（\ s | $）/，＆＃34; $ 1client-js $ 2＆＃34;）; \ n（window.RLQ = window.RLQ || []）。push（function（）{mw.config.set（{＆＃ 34; wgCanonicalNamespace＆＃34;：＆＃34;＆＃34;＆＃34; wgCanonicalSpecialPageName＆＃34;：假，＆＃34; wgNamespaceNumber＆＃34;：0，＆＃34; wgPageName＆＃34;：＆＃34 ;城堡＆＃34;＆＃34; wgTitle＆＃34;：＆＃34;城堡＆＃34;＆＃34; wgCurRevisionId＆＃34;：5333370，＆＃34; wgRevisionId＆＃34;：5333370，＆＃34; wgArticleId＆＃34;：15933＆＃34; wgIsArticle＆＃34;：真，＆＃34; wgIsRedirect＆＃34;：假，＆＃34; wgAction＆＃34;：＆＃34;视图＆＃34;＆＃34 ; wgUserName＆＃34;：空，＆＃34; wgUserGroups＆＃34;：[＆＃34; ＆＃34]，＆＃34; wgCategories＆＃34;：[＆＃34;城堡＆＃34;] ＆＃34; wgBreakFrames＆＃34;：假，＆＃34; wgPageContentLanguage＆＃34;：＆＃34;恩＆＃34;＆＃34; wgPageContentModel＆＃34;：＆＃34; wikitext的＆＃34;＆＃ 34; wgSeparatorTransformTable＆＃34;：[＆＃34;＆＃34;＆＃34;＆＃34]，＆＃34; wgDigitTransformTable＆＃34;：[＆＃34;＆＃34;＆＃34; ＆＃34]，＆＃34; wgDefaul tDateFormat＆＃34;：＆＃34; DMY＆＃34;＆＃34; wgMonthNames＆＃34;：[＆＃34;＆＃34;＆＃34; 1＆＃34;＆＃34;二月＆＃34; ＆＃34;三月＆＃34;＆＃34; 4月＆＃34;＆＃34;五月＆＃34;＆＃34; 6月＆＃34;＆＃34; 7月＆＃34;＆＃34;八月＆＃34;＆＃34; 9月＆＃34;＆＃34; 10月＆＃34;＆＃34;十一月＆＃34;＆＃34;十二月＆＃34;]＆＃34; wgMonthNamesShort＆＃34; ：＆＃34;＆＃34;＆＃34;扬＆＃34;＆＃34;二月＆＃34;＆＃34;三月＆＃34;＆＃34;四月＆＃34;＆＃34 ;可＆＃34;＆＃34;君＆＃34;＆＃34;七月＆＃34;＆＃34;八月＆＃34;＆＃34;九月＆＃34;＆＃34;十月＆＃34; ＆＃34;十一月＆＃34;＆＃34;减速＆＃34]，＆＃34; wgRelevantPageName＆＃34;：＆＃34;城堡＆＃34;＆＃34; wgRelevantArticleId＆＃34;：15933，＆＃34; wgRequestId＆＃34;：＆＃34; VxUR5gpAIDAAAEXY6FMAAACC＆＃34;＆＃34; wgIsProbablyEditable＆＃34;：真，＆＃34; wgRestrictionEdit＆＃34;：[]，＆＃34; wgRestrictionMove＆＃34;：[ ]＆＃34; wgWikiEditorEnabledModules＆＃34; {＆＃34;工具栏＆＃34;：真，＆＃34;对话框＆＃34;：真，＆＃34;预览＆＃34;：假，＆＃34;发布＆＃34;：假}＆＃34; wgBetaFeaturesFeatures＆＃34;：[]，＆＃34; wgMediaViewerOnClick＆＃34;：真，＆＃34; wgMediaViewerEnabledByDefault＆＃34;：真，＆＃34; wgVisualEditor＆＃34 ;: {＆＃34; pageLanguageCode＆＃34;：＆＃34;恩＆＃34;＆＃34; pageLanguageDir＆＃34;：＆＃34; LTR＆＃34;＆＃34; usePageImages＆＃34;：真，＆＃34; usePageDescriptions＆＃34 ;：真}＆＃34; wgPreferredVariant＆＃34;：＆＃34;恩＆＃34;＆＃34; wgRelatedArticles＆＃34;：空，＆＃34; wgRelatedArticlesUseCirrusSearch＆＃34;：真，＆＃34; wgRelatedArticlesOnlyUseCirrusSearch＆＃34;：假，＆＃34; wgULSAcceptLanguageList＆＃34;：[]，＆＃34; wgULSCurrentAutonym＆＃34;：＆＃34;英语＆＃34;＆＃34; wgCategoryTreePageCategoryOptions＆＃34;：＆＃34; { \＆＃34;模式\＆＃34;：0，\＆＃34; hideprefix \＆＃34;：20，\＆＃34; showcount \＆＃34;：真，\＆＃34;命名空间\＆＃34;：假}＆＃34;＆＃34; wgNoticeProject＆＃34;：＆＃34;维基＆＃34;＆＃34; wgCentralNoticeCategoriesUsingLegacy＆＃34;：[＆＃34;筹款＆＃34;＆＃ 34;筹款＆＃34]，＆＃34; wgCentralAuthMobileDomain＆＃34;：假，＆＃34; wgWikibaseItemId＆＃34;：＆＃34; Q23413＆＃34;＆＃34; wgVisualEditorToolbarScrollOffset＆＃34;：0}）; mw.loader.implement（＆＃34; user.options＆＃34;，功能（$，jQuery的）{mw.user.options.set（{＆＃34;变体＆＃34;：＆＃34;恩＆＃34; }）;}）; mw.loader.implement（＆＃34; user.tokens＆＃34;，功能（$，jQuery）{\ nmw.user.tokens.set（{＆＃34; editToken＆＃34;：＆＃34; + \\＆＃34;，＆＃34; patrolToken＆＃34;：＆＃34; + \\＆＃34;＆＃34; watchToken＆＃34;：＆＃34 + \\＆＃34;＆＃34; csrfToken＆＃34;：＆＃34 + \\＆＃34;} ）; / @nomin * /; \ n \ n}）; mw.loader.load（[＆＃34; mw.MediaWikiPlayer.loader＆＃34;＆＃34; mw.PopUpMediaTransform＆＃34 ;, ＆＃34; mw.TMHGalleryHook.js＆＃34;＆＃34; mediawiki.page.startup＆＃34;＆＃34; mediawiki.legacy.wikibits＆＃34;＆＃34; ext.centralauth.centralautologin＆＃34 ;，＆＃34; mmv.head＆＃34;＆＃34; ext.visualEditor.desktopArticleTarget.init＆＃34;＆＃34; ext.uls.init＆＃34;＆＃34;＆ext.uls.interface ＃34;＆＃34; ext.centralNotice.bannerController＆＃34;＆＃34; skins.vector.js＆＃34;]）;}）; \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ nCastle \ n \ n来自维基百科，免费的百科全书\ n \ n \ n \ t \ t \ t \ t \ t跳转到：\ t \ t \ t \ t \ tnavigation，\ t \ t \ t \ t \ tsearch \ n \ n \ n \ n \ n \ n \ n英国的Bodiam城堡被充满水的护城河包围。\ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n城堡\ n \ n \ n \ n城堡（来自拉丁文castellum）是欧洲和中东制造的强化建筑在中世纪。人们争论城堡这个词是什么意思。然而，它通常意味着主或贵族的私人结构。这不同于一个不是家的堡垒，而是一个设防城镇，这是一个公共防御。大约900年来，城堡建成后，它们有许多不同的形状和不同的细节。\ nCastles在9和10世纪开始于欧洲。他们控制着周围的地方，并且可以帮助进攻和防守。武器可以从城堡中射击，也可以保护人们免受城堡中的敌人的攻击。然而，城堡也是权力的象征。它们可以用来控制周围的人和道路。\ n许多城堡最初都是用土和木建造的，经常使用体力劳动，然后用石头代替他们的防御。早期的城堡经常使用自然保护，并没有塔。然而，到了12世纪末和13世纪初，城堡变得越来越复杂。\ n

Answer 1

file.writelines(str(soup.encode('UTF-8')))有点疯狂，是：

将文字（str）编码为二进制文件（bytes）
通过包装str获取文本表示形式（所以它是你要键入的内容来重新创建二进制字节，但它不是原始二进制文件）
一次写出一个字符（writelines迭代你给出的字符，str按字符迭代）

第3步是愚蠢而低效的，但大多是无害的。如果您将原始二进制文件写入为二进制写入打开的文件并且实际编写了bytes对象，那么步骤＃1就可以了。但＃1和＃2一起意味着像新行这样的东西会在输出中转换为文字\n，而不是实际断开一条线。像é这样的非ASCII内容输出为\xc3\xa9，整个内容包含在b''（或b""）中。

你想要这样的东西：

# open with UTF-8 encoding (in case your system defaults to something else)
with open("TextOutput.txt", 'w', encoding='utf-8') as file:
    # Get the text and write it as a single block
    file.write(soup.get_text())

与生成的文本文件相比，为什么Python 3 shell中的文本格式不同？

1 个答案: