为什么会这样？

Question

我将urllib.request和regex用于html parse，但是当我在json文件中写入时，文本中会有两个反斜杠。如何替换一个反斜杠？我看过很多解决方案，但是都没有用。

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'
req = Request('https://www.manga-tr.com/manga-list.html', headers=headers)
response = urlopen(req).read()
a = re.findall(r'<b><a[^>]* href="([^"]*)"',str(response))
sub_req = Request('https://www.manga-tr.com/'+a[3], headers=headers)
sub_response = urlopen(sub_req).read()
manga = {}
manga['manga'] = []
manga_subject = re.findall(r'<h3>Tan.xc4.xb1t.xc4.xb1m<.h3>[^<]*.n.t([^<]*).t',str(sub_response))
manga['manga'].append({'msubject': manga_subject })
with io.open('allmanga.json', 'w', encoding='utf-8-sig') as outfile:
outfile.write(json.dumps(manga, indent=4))

这是我的json文件

{
    "manga": [
        {
            "msubject": [
                "  Minami Ria 16 ya\\xc5\\x9f\\xc4\\xb1ndad\\xc4\\xb1r. \\xc4\\xb0lk erkek arkada\\xc5\\x9f\\xc4\\xb1 sakatani jirou(16) ile yakla\\xc5\\x9f\\xc4\\xb1k 6 ayd\\xc4\\xb1r beraberdir. Herkes taraf\\xc4\\xb1ndan \\xc3\\xa7ifte kumru olarak g\\xc3\\xb6r\\xc3\\xbclmelerine ra\\xc4\\x9fmen ili\\xc5\\x9fkilerinde %1\\'lik bir eksiklik vard\\xc4\\xb1r. Bu eksikli\\xc4\\x9fi tamamlayabilecekler mi?"
        }
    ]
}

Answer 1

为什么会这样？

错误是使用str将bytes对象转换为str时出现的。这不会以所需的方式进行转换。

a = re.findall(r'<b><a[^>]* href="([^"]*)"',str(response))
#                                           ^^^

例如，如果响应为单词“Tanıtım”，则您将在UTF-8中将其表示为b'Tan\xc4\xb1t\xc4\xb1m'。如果您随后对此使用str，则会得到：

In [1]: response = b'Tan\xc4\xb1t\xc4\xb1m'

In [2]: str(response)
Out[2]: "b'Tan\\xc4\\xb1t\\xc4\\xb1m'"

如果将其转换为JSON，则会看到双反斜杠（实际上只是普通的反斜杠，编码为JSON）。

In [3]: import json

In [4]: print(json.dumps(str(response)))
"b'Tan\\xc4\\xb1t\\xc4\\xb1m'"

将bytes对象转换回str的正确方法是使用decode方法，并采用适当的编码：

In [5]: response.decode('UTF-8')
Out[5]: 'Tanıtım'

请注意，不幸的是，该响应不是有效的UTF-8。网站运营商似乎正在提供损坏的数据。

快速修复

将每个对str(response)的调用替换为response.decode('UTF-8', 'replace')，并更新正则表达式以使其匹配。

a = re.findall(
    # "r" prefix to string is unnecessary
    '<b><a[^>]* href="([^"]*)"',
    response.decode('UTF-8', 'replace'))
sub_req = Request('https://www.manga-tr.com/'+a[3], 
                  headers=headers)
sub_response = urlopen(sub_req).read()
manga = {}
manga['manga'] = []
manga_subject = re.findall(
    # "r" prefix to string is unnecessary
    '<h3>Tanıtım</h3>([^<]*)',
    sub_response.decode('UTF-8', 'replace'))
manga['manga'].append({'msubject': manga_subject })
# io.open is the same as open
with open('allmanga.json', 'w', encoding='utf-8-sig') as fp:
    # json.dumps is unnecessary
    json.dump(manga, fp, indent=4)

更好的解决方法

使用“请求”

Requests库比使用urlopen容易得多。您将必须安装它（使用pip，apt，dnf等，无论您使用什么），Python都不提供。看起来像这样：

response = requests.get(
    'https://www.manga-tr.com/manga-list.html')

然后response.text包含已解码的字符串，您不需要自己对其进行解码。更容易！

使用BeautifulSoup

Beautiful Soup库可以搜索HTML文档，并且比正则表达式更可靠，更易于使用。还需要安装它。例如，您可以使用它来从漫画页面中查找所有摘要：

soup = BeautifulSoup(response.text, 'html.parser')
subject = soup.find('h3', text='Tanıtım').next_sibling.string

摘要

这里是Gist，其中包含刮板可能看起来更完整的示例。

请记住，抓取网站可能会有些困难，仅因为您可能抓取100页，然后突然发现抓取器有问题，或者您对网站的访问过分严重，或者崩溃或失败，并且您需要重新开始。因此，良好的抓取通常涉及限速，保存进度和缓存响应，以及（理想情况下）解析robots.txt。

但是Requests + BeautifulSoup至少可以帮助您入门。同样，请参见Gist。

Python 3编写json文件时出现双反斜杠问题

1 个答案:

为什么会这样？

快速修复

更好的解决方法

使用“请求”

使用BeautifulSoup

摘要