Question

我正在尝试使用Python从网站上检索一些文本，然后使用该文本创建一个 .txt 文件。我正在使用Beautiful Soup 4和Requests从网站获取信息。我能够拉出文本并创建文件，没有问题，但是当我在VSCode上打开生成的文本时，我得到了：

�It�s the year 3486 of the Saint Origin calendar. I was dead for over a hundred years. Jiang Chen, my name is�Jiang Chen. Why have I been reborn after a hundred years?�

将此与网站进行比较，我们可以看到``应该是某个标点符号。然后，我尝试使用：

text = text.replace(u"\u201c", '"')

要替换一些双引号，但这只能解决部分问题，它留下了很多``并且试图找到所有使用相同方法的标点符号是不可行的。

是否可以解决此问题，也许会强制我要使用的字符类型？

如果需要，这是我的源代码：

# MODULES NEEDED:
from bs4 import BeautifulSoup
import requests

# Link from which we want the text:
link =  "http://liberspark.com/read/dragon-marked-war-god/chapter-1"

# Getting the page's source code:
source = requests.get(link)

# Creating the BeautifulSoup object:
source = BeautifulSoup(source.content.decode("utf-8"), "html.parser")

# Finding the the div which holds the text:
container = source.find("div", class_="reader-content")

# Variable that will hold all the text:
text =  ""

# Going through all the <p> tags in the container:
for p in container.find_all("p"):
    text +=  str(p.text) +  "\n\n"

text = text.replace(u"\u2019", "'")

with  open("test.txt", "w") as  file:
    file.write(text)

Answer 1

这是因为test.txt不是以utf-8格式编写的，因此请使用wb标志和.encode('utf-8')

以二进制模式写入文件

with open("test.txt", "wb") as  file:
    file.write(text.encode('utf-8'))

Python Web Scrapping返回奇怪的字符

1 个答案: