Python Web Scrapping返回奇怪的字符

时间:2019-01-25 04:42:53

标签: python-3.x web-scraping beautifulsoup

我正在尝试使用Python从网站上检索一些文本,然后使用该文本创建一个 .txt 文件。 我正在使用Beautiful Soup 4和Requests从网站获取信息。我能够拉出文本并创建文件,没有问题,但是当我在VSCode上打开生成的文本时,我得到了:

�It�s the year 3486 of the Saint Origin calendar. I was dead for over a hundred years. Jiang Chen, my name is�Jiang Chen. Why have I been reborn after a hundred years?�

将此与网站进行比较,我们可以看到``应该是某个标点符号。然后,我尝试使用:

text = text.replace(u"\u201c", '"')

要替换一些双引号,但这只能解决部分问题,它留下了很多``并且试图找到所有使用相同方法的标点符号是不可行的。

是否可以解决此问题,也许会强制我要使用的字符类型?

如果需要,这是我的源代码:

# MODULES NEEDED:
from bs4 import BeautifulSoup
import requests

# Link from which we want the text:
link =  "http://liberspark.com/read/dragon-marked-war-god/chapter-1"

# Getting the page's source code:
source = requests.get(link)

# Creating the BeautifulSoup object:
source = BeautifulSoup(source.content.decode("utf-8"), "html.parser")

# Finding the the div which holds the text:
container = source.find("div", class_="reader-content")

# Variable that will hold all the text:
text =  ""

# Going through all the <p> tags in the container:
for p in container.find_all("p"):
    text +=  str(p.text) +  "\n\n"

text = text.replace(u"\u2019", "'")

with  open("test.txt", "w") as  file:
    file.write(text)

1 个答案:

答案 0 :(得分:1)

这是因为test.txt不是以utf-8格式编写的,因此请使用wb标志和.encode('utf-8')

以二进制模式写入文件

with open("test.txt", "wb") as  file:
    file.write(text.encode('utf-8'))