我正在尝试使用Python从网站上检索一些文本,然后使用该文本创建一个 .txt 文件。 我正在使用Beautiful Soup 4和Requests从网站获取信息。我能够拉出文本并创建文件,没有问题,但是当我在VSCode上打开生成的文本时,我得到了:
�It�s the year 3486 of the Saint Origin calendar. I was dead for over a hundred years. Jiang Chen, my name is�Jiang Chen. Why have I been reborn after a hundred years?�
将此与网站进行比较,我们可以看到``应该是某个标点符号。然后,我尝试使用:
text = text.replace(u"\u201c", '"')
要替换一些双引号,但这只能解决部分问题,它留下了很多``并且试图找到所有使用相同方法的标点符号是不可行的。
是否可以解决此问题,也许会强制我要使用的字符类型?
如果需要,这是我的源代码:
# MODULES NEEDED:
from bs4 import BeautifulSoup
import requests
# Link from which we want the text:
link = "http://liberspark.com/read/dragon-marked-war-god/chapter-1"
# Getting the page's source code:
source = requests.get(link)
# Creating the BeautifulSoup object:
source = BeautifulSoup(source.content.decode("utf-8"), "html.parser")
# Finding the the div which holds the text:
container = source.find("div", class_="reader-content")
# Variable that will hold all the text:
text = ""
# Going through all the <p> tags in the container:
for p in container.find_all("p"):
text += str(p.text) + "\n\n"
text = text.replace(u"\u2019", "'")
with open("test.txt", "w") as file:
file.write(text)
答案 0 :(得分:1)
这是因为test.txt
不是以utf-8
格式编写的,因此请使用wb
标志和.encode('utf-8')
with open("test.txt", "wb") as file:
file.write(text.encode('utf-8'))