Question

如何防止slugify过滤器删除非ASCII字母数字字符？（我正在使用Django 1.0.2）

cnprog.com有问题网址的中文字符，所以我查看了他们的代码。他们没有在模板中使用slugify，而是在Question模型中调用此方法来获取永久链接

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

他们是否在诋毁网址？

Answer 1

有一个名为unidecode的python包，我已经在askbot Q＆amp; A论坛中采用了它，它适用于基于拉丁语的字母表，甚至对希腊语看起来也很合理：

>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'διακριτικός')
'diakritikos'

它使用亚洲语言做了一些奇怪的事情：

>>> unidecode(u'影師嗎')
'Ying Shi Ma '
>>>

这有意义吗？

在askbot中，我们像这样计算slu ::

from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))

Answer 2

Mozilla网站团队一直致力于实施： https://github.com/mozilla/unicode-slugify 示例代码 http://davedash.com/2011/03/24/how-we-slug-at-mozilla/

Answer 3

此外，Django版本的slugify不使用re.UNICODE标志，因此它甚至不会尝试理解\w\s的含义，因为它与非ascii字符有关。

这个自定义版本对我很有用：

def u_slugify(txt):
        """A custom version of slugify that retains non-ascii characters. The purpose of this
        function in the application is to make URLs more readable in a browser, so there are 
        some added heuristics to retain as much of the title meaning as possible while 
        excluding characters that are troublesome to read in URLs. For example, question marks 
        will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
        characters will also be hex-encoded in the raw URL, most browsers will display them
        as human-readable glyphs in the address bar -- those should be kept in the slug."""
        txt = txt.strip() # remove trailing whitespace
        txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
        txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
        txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
        txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
        txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
        return txt

注意最后的正则表达式替换。这是一个更强大的表达式r'\W'的问题的解决方法，它似乎剥离了一些非ascii字符或错误地重新编码它们，如下面的python解释器会话所示：

Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = '您認識對全球社區感興趣的中國攝影師嗎'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
您認識對全球社區感興趣的中國攝影師嗎
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>> 
>>> 
>>> # Now do the same with only the last three glyphs of the string
>>> str = '影師嗎'
>>> print str
影師嗎
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters

我不确定上面的问题是什么，但我猜它源于“whatever is classified as alphanumeric in the Unicode character properties database”，以及如何实现。我听说python 3.x在更好的unicode处理方面具有高优先级，所以这可能已经修复了。或者，也许这是正确的python行为，我滥用unicode和/或中文。

目前，解决方法是避免使用字符类，并根据明确定义的字符集进行替换。

Answer 4

我担心django对slug的定义意味着ascii，尽管django docs没有明确说明这一点。这是slugify的默认过滤器的来源...您可以看到值正在转换为ascii，如果出现错误，则使用'ignore'选项：

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

基于此，我猜cnprog.com没有使用官方slugify功能。如果您想要不同的行为，您可能希望调整上面的django片段。

尽管如此，URL的RFC确实说明了非us-ascii字符（或者更具体地说，除了字母数字和$-。+！*'（）之外的其他字符）应该使用％hex表示法。如果您查看浏览器发送的实际原始GET请求（例如，使用Firebug），您将看到中文字符实际上在被发送之前被编码...浏览器只是使它在显示中看起来很漂亮。我怀疑这就是为什么slugify只坚持ascii，fwiw。

Answer 5

使用 Django＆gt; = 1.9 ，django.utils.text.slugify有一个allow_unicode参数：

>>> slugify("你好 World", allow_unicode=True)
"你好-world"

如果您使用Django＆lt; = 1.8（自2018年4月起不应该使用），您可以pick up the code from Django 1.9。

Answer 6

你可能想看看： https://github.com/un33k/django-uuslug

它将照顾你们两个“U”。唯一的 U 和Unicode中的 U 。

它可以帮助您轻松自如。

Answer 7

我感兴趣的是只允许slug中的ASCII字符，这就是为什么我尝试对同一字符串的一些可用工具进行基准测试：

Unicode Slugify：

In [5]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o', only_ascii=True)
37.8 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-glo-la-fdo'

Django Uuslug：

In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
35.3 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-g-lo-la-fd-o'

Awesome Slugify：

In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
47.1 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'Paizo-trekho-kai-g-lo-la-fd-o'

Python Slugify：

In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
24.6 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-g-lo-la-fd-o'

django.utils.text.slugify和Unidecode：

In [15]: %timeit slugify(unidecode('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o'))
36.5 µs ± 89.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-glo-la-fdo'

如何使Django slugify与Unicode字符串正常工作？

8 个答案: