如何将BeautifulSoup对象保存到文件,然后以BeautifulSoup的形式从中读取?

时间:2018-10-24 16:16:48

标签: python beautifulsoup

我想将BeautifulSoup对象保存到文件中。因此,我将其更改为字符串,然后将其写入文件。然后,在将其读取为字符串之后,将其转换为BeautifulSoup对象。这对我的测试很有帮助,因为要抓取的数据是动态的。

url = "https://coinmarketcap.com/all/views/all/"
html = urlopen(url)
soup = BeautifulSoup(html,"lxml")

这样写汤对象:

  new_soup = str(soup)
  with open("coin.txt", "w+") as f:
      f.write(new_soup)

产生此错误:

UnicodeEncodeError: 'charmap' codec can't encode 
characters in position 28127-28132: character maps to <undefined>

此外,如果我能够将其保存到文件中,我将如何读取作为BeautifulSoup对象返回的字符串?

1 个答案:

答案 0 :(得分:1)

编辑

由于soup,旧代码无法腌制RecursionError对象:

Traceback (most recent call last):
  File "soup.py", line 20, in <module>
    pickle.dump(soup, f)
RecursionError: maximum recursion depth exceeded while calling a Python object

解决方法是increase the recursion limit。它们在this answer中执行相同的操作,而后者又引用了docs

如何,您要加载和保存的特定网站是至尊嵌套的。我的计算机无法超过50000的递归限制,并且对您的网站来说还不够,并且崩溃:10008 segmentation fault (core dumped) python soup.py

因此,如果您需要下载HTML并在以后使用,则可以执行以下操作:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://coinmarketcap.com/all/views/all/"
html = urlopen(url)

# Save HTML to a file
with open("soup.html", "wb") as f:
    while True:
        chunk = html.read(1024)
        if not chunk:
            break
        f.write(chunk)

然后,您可以读取保存的HTML文件并使用它实例化bs4对象:

# Read HTML from a file
with open("soup.html", "rb") as f:
    soup = BeautifulSoup(f.read(), "lxml")

print(soup.title)
# <title>All Cryptocurrencies | CoinMarketCap</title>

此外,这是我将用于较少嵌套网站的代码:

import pickle
from bs4 import BeautifulSoup
from urllib.request import urlopen
import sys

url = "https://stackoverflow.com/questions/52973700/how-to-save-the-beautifulsoup-object-to-a-file-and-then-read-from-it-as-beautifu"
html = urlopen(url)
soup = BeautifulSoup(html,"lxml")

sys.setrecursionlimit(8000)

# Save the soup object to a file
with open("soup.pickle", "wb") as f:
    pickle.dump(soup, f)

# Read the soup object from a file
with open("soup.pickle", "rb") as f:
    soup_obj = pickle.load(f)

print(soup_obj.title)

# <title>python - How to save the BeautifulSoup object to a file and then read from it as BeautifulSoup? - Stack Overflow</title>.