尝试使用BeautifulSoup进行写入时出现文件模式错误

时间:2019-12-05 14:05:22

标签: python beautifulsoup

我在处理文件时遇到问题。该函数首先在所有文件中搜索字符串。然后用新值替换它们。我毕竟不知道如何在同一个文件中写入新内容。我认为问题是文件模式,但不确定如何处理,因为当我在其他地方更改模式时,会出现新错误。

 def replace_urls(self):
        find_string_1 = '/blog/'
        find_string_2 = '/contakt/'
        replace_string_1 = 'blog.html'
        replace_string_2 = 'contact.html'

        exclude_dirs = ['media', 'static']

        for (root_path, dirs, files) in os.walk(f'{settings.BASE_DIR}/static/'):
            dirs[:] = [d for d in dirs if d not in exclude_dirs]
            for file in files:
                get_file = os.path.join(root_path, file)
                with open(get_file, 'wb', encoding='utf-8') as f:
                    soup = BeautifulSoup(f, "lxml", from_encoding="utf-8")
                    blog_text = soup.find('a', attrs={'href':find_string_1})
                    contact_text = soup.find('a', attrs={'href':find_string_2})
                    blog_text.attrs['href'] = replace_string_1
                    contact_text.attrs['href'] = replace_string_2
                    f.write(soup.prettify('utf-8'))

上面的错误代码:

以open(get_file,'wb',encoding ='utf-8')作为f:

ValueError:二进制模式不使用编码参数

重要:

我想将此功能用作django命令:

所以我用python manage.py command_name

称呼它
from django.core.management.base import BaseCommand
from django.conf import settings
import os
import codecs
from bs4 import BeautifulSoup
from lxml import etree


class Command(BaseCommand):
    help='change urls in each header to static version'


    def replace_urls(self):
        find_string_1 = '/blog/'
        find_string_2 = '/contact/'
        replace_string_1 = 'blog.html'
        replace_string_2 = 'contact.html'

        exclude_dirs = ['media', 'static']

        for (root_path, dirs, files) in os.walk(f'{settings.BASE_DIR}/static/'):
            dirs[:] = [d for d in dirs if d not in exclude_dirs]
            for file in files:
                get_file = os.path.join(root_path, file)
                with open(get_file, 'wb', encoding='utf-8') as f:
                    soup = BeautifulSoup(f, "lxml", from_encoding="utf-8")
                    blog_text = soup.find('a', attrs={'href':find_string_1})
                    contact_text = soup.find('a', attrs={'href':find_string_2})
                    blog_text.attrs['href'] = replace_string_1
                    contact_text.attrs['href'] = replace_string_2
                    f.write(soup.prettify('utf-8'))


    def handle(self, *args, **kwargs):
        try:
            self.replace_urls()
            self.stdout.write(self.style.SUCCESS(f'********** Command has been execute without any error **********'))
        except Exception:
            self.stdout.write(self.style.NOTICE(f'********** Command  does not exist ! **********'))

2 个答案:

答案 0 :(得分:0)

在开头添加“ b”将打开二进制模式。
此模式不支持编码。

您可以为此使用编解码器库。

这是我的建议:

import codecs

def replace_urls(self):
        find_string_1 = '/blog/'
        find_string_2 = '/contakt/'
        replace_string_1 = 'blog.html'
        replace_string_2 = 'contact.html'

        exclude_dirs = ['media', 'static']

        for (root_path, dirs, files) in os.walk(f'{settings.BASE_DIR}/static/'):
            dirs[:] = [d for d in dirs if d not in exclude_dirs]
            for file in files:
                get_file = os.path.join(root_path, file)
                with codecs.open(get_file, "w", "utf-8") as f:
                    soup = BeautifulSoup(f, "lxml", from_encoding="utf-8")
                    blog_text = soup.find('a', attrs={'href':find_string_1})
                    contact_text = soup.find('a', attrs={'href':find_string_2})
                    blog_text.attrs['href'] = replace_string_1
                    contact_text.attrs['href'] = replace_string_2
                    f.write(soup.prettify('utf-8'))

简单功能测试:

import codecs

file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\ufeff')
file.close()

另一种可能性是跳过编码:

with open(get_file, 'w', encoding='utf-8') as f:

答案 1 :(得分:0)

正如错误日志中提到的那样,您正在以字节模式进行写入,这意味着数据已经被编码,因此您基本上需要将字节保存到文件中。您既可以在写入文件之前进行编码,也可以将编码后的字节写入文件。

您已经使用soup.prettify('utf-8')对html进行了编码。这意味着无需将encoding参数传递给open函数,例如:

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html><header></header></html>")
with open("test.html", "wb") as f:
    f.write(soup.prettify('utf-8'))

这应该对您有用:

 def replace_urls(self):
        find_string_1 = '/blog/'
        find_string_2 = '/contakt/'
        replace_string_1 = 'blog.html'
        replace_string_2 = 'contact.html'

        exclude_dirs = ['media', 'static']

        for (root_path, dirs, files) in os.walk(f'{settings.BASE_DIR}/static/'):
            dirs[:] = [d for d in dirs if d not in exclude_dirs]
            for file in files:
                get_file = os.path.join(root_path, file)
                with open(get_file, 'wb') as f:
                    soup = BeautifulSoup(f, "lxml", from_encoding="utf-8")
                    blog_text = soup.find('a', attrs={'href':find_string_1})
                    contact_text = soup.find('a', attrs={'href':find_string_2})
                    blog_text.attrs['href'] = replace_string_1
                    contact_text.attrs['href'] = replace_string_2
                    f.write(soup.prettify('utf-8'))