Question

首先，我是python中的新手，如果您考虑投票，请发表评论

我有一个网址，例如

http://example.com/here/there/index.html

现在我想将文件及其内容保存在目录中。我希望文件的名称是：

http://example.com/here/there/index.html

但是我收到错误，我猜这个错误是网址名称中/的结果。

这就是我现在正在做的事情。

        with open('~/' + response.url, 'w') as f:
            f.write(response.body)

任何想法我应该怎么做呢？

Answer 1

您可以使用可逆的base64编码。

>>> import base64
>>> base64.b64encode('http://example.com/here/there/index.html')
'aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA=='
>>> base64.b64decode('aHR0cDovL2V4YW1wbGUuY29tL2hlcmUvdGhlcmUvaW5kZXguaHRtbA==')
'http://example.com/here/there/index.html'

或者binascii

>>> binascii.hexlify(b'http://example.com/here/there/index.html')
'687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c'
>>> binascii.unhexlify('687474703a2f2f6578616d706c652e636f6d2f686572652f74686572652f696e6465782e68746d6c')
'http://example.com/here/there/index.html'

Answer 2

你有几个问题。其中之一是Unix shell缩写（~）不会被Python自动解释，因为它们在Unix shell中。

第二个问题是你在Unix中编写带有嵌入式斜杠的文件路径并没有好运。如果您以后想要检索它们，您将需要将它们转换为其他内容。你可以用response.url.replace('/','_')这样简单的东西来做到这一点，但这会留下许多其他可能存在问题的角色。您可能希望一次性“消毒”所有这些。例如：

import os
import urllib

def write_response(response, filedir='~'):
    filedir = os.path.expanduser(dir)
    filename = urllib.quote(response.url, '')
    filepath = os.path.join(filedir, filename)
    with open(filepath, "w") as f:
        f.write(response.body)

这使用os.path函数来清理文件路径，并使用urllib.quote将URL清理为可用于文件名的内容。有一个相应的unquote来反转这个过程。

最后，当您写入文件时，您可能需要稍微调整一下，具体取决于响应的内容以及您希望如何编写它们。如果您希望它们以二进制形式编写，则您需要"wb"而不仅仅是"w"作为文件模式。或者如果是文本，则可能首先需要某种编码（例如，utf-8）。这取决于您的回答是什么，以及它们是如何编码的。

Answer 3

使用urllib.urlretrieve:

    import urllib

    testfile = urllib.URLopener()
    testfile.retrieve("http://example.com/here/there/index.html", "/tmp/index.txt")

Answer 4

这是一个坏主意，因为文件名会达到255个字节限制，因为b64编码时，URL会变得非常长甚至更长！

您可以进行压缩和b64编码，但这不会使您走得太远：

from base64 import b64encode 
import zlib
import bz2
from urllib.parse import quote

def url_strategies(url):
    url = url.encode('utf8')
    print(url.decode())
    print(f'normal  : {len(url)}')
    print(f'quoted  : {len(quote(url, ""))}')
    b64url = b64encode(url)
    print(f'b64     : {len(b64url)}')
    url = b64encode(zlib.compress(b64url))
    print(f'b64+zlib: {len(url)}')
    url = b64encode(bz2.compress(b64url))
    print(f'b64+bz2: {len(url)}')

这是我在angel.co上找到的平均网址：


URL = 'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'

即使使用b64 + zlib，它也不适合255个限制：

normal  : 316
quoted  : 414
b64     : 424
b64+zlib: 304
b64+bz2 : 396

即使采用zlib压缩和b64encode的最佳策略，您仍然会遇到麻烦。

正确的解决方案

或者，您应该做的是对网址进行哈希处理，然后将网址作为文件属性附加到文件中：

import os
from hashlib import sha256

def save_file(url, content, char_limit=13):
    # hash url as sha256 13 character long filename
    hash = sha256(url.encode()).hexdigest()[:char_limit]
    filename = f'{hash}.html'
    # 93fb17b5fb81b.html
    with open(filename, 'w') as f:
        f.write(content)
    # set url attribute
    os.setxattr(filename, 'user.url', url.encode())

然后可以检索url属性：

print(os.getxattr(filename, 'user.url').decode())
'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'

注意：setxattr和getxattr在python中需要user.前缀
有关python中文件属性的信息，请参见此处的相关问题：https://stackoverflow.com/a/56399698/3737009

Answer 5

可以查看restricted charaters。

我会使用典型的文件夹结构来完成此任务。如果你将使用它与很多网址，它将得到某种程度或其他混乱。而且你也会遇到文件系统性能问题或限制。

将url保存为python中的文件名

5 个答案:

正确的解决方案