Question

我正在Django中构建一个论坛应用程序，我想确保用户不要在他们的论坛帖子中输入某些字符。我需要一种有效的方法来扫描整个帖子以检查无效字符。到目前为止我所拥有的是以下内容，虽然它不能正常工作，但我认为这个想法效率不高。

def clean_topic_message(self):
    topic_message = self.cleaned_data['topic_message']
    words = topic_message.split()
    if (topic_message == ""):
        raise forms.ValidationError(_(u'Please provide a message for your topic'))
    ***for word in words:
        if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
            raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
    return topic_message

感谢您的帮助。

Answer 1

对于正则表达式解决方案，有两种方法可以到达：

在字符串中的任意位置找到一个无效字符。
验证字符串中的每个字符。

这是一个实现两者的脚本：

import re
topic_message = 'This topic is a-ok'

# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
    print ("RE1: Invalid char detected.")
else:
    print ("RE1: No invalid char detected.")

# Option 2: Validate all chars in string.
re2 =  re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
    print ("RE2: All chars are valid.")
else:
    print ("RE2: Not all chars are valid.")

选择。

注意：原始正则表达式错误地在字符类中有一个右方括号，需要进行转义。

基准：在使用set()看到gnibbler有趣的解决方案后，我很想知道哪些方法实际上最快，所以我决定测量它们。以下是测量的基准数据和语句以及timeit结果值：

测试数据：

r"""
TEST topic_message STRINGS:
ok:  'This topic is A-ok.     This topic is     A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'

MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""

<强>结果：

r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method  Ok-time  Bad-time
1        1.054    1.190
2        1.830    1.636
3        4.364    4.577
"""

基准测试显示，选项1略快于选项2，两者都比set().intersection()方法快得多。对于匹配和不匹配的字符串都是如此。

Answer 2

如果效率是一个主要问题，我会重新编译（）re字符串，因为你将多次使用相同的正则表达式。

Answer 3

re.match和re.search表现differently。使用正则表达式搜索不需要拆分单词。

import re
symbols_re = re.compile(r"[^<>/\{}[]~`]");

if symbols_re.search(self.cleaned_data('topic_message')):
    //raise Validation error

Answer 4

使用正则表达式时你必须要小心 - 它们充满了陷阱。

在[^<>/\{}[]~]的情况下，第一个]关闭了可能不是您想要的组。如果您想在群组中使用]，则必须是[之后的第一个字符，例如[]^<>/\{}[~]

简单测试确认了这个

>>> import re
>>> re.search("[[]]","]")
>>> re.search("[][]","]")
<_sre.SRE_Match object at 0xb7883db0>

无论如何，

正则表达式对此问题有点过分

def clean_topic_message(self):
    topic_message = self.cleaned_data['topic_message']
    invalid_chars = '^<>/\{}[]~`$'
    if (topic_message == ""):
        raise forms.ValidationError(_(u'Please provide a message for your topic'))
    if set(invalid_chars).intersection(topic_message):
        raise forms.ValidationError(_(u'Topic message cannot contain the following: %s'%invalid_chars))
    return topic_message

Answer 5

我不能说什么会更有效率，但你当然应该摆脱$（除非它是消息的无效字符）...现在你只匹配{{1}如果字符位于re的末尾，因为topic_message将匹配锚定在该行的右侧。

Answer 6

is_valid = not any（'＆lt;＆gt; / {}中的k的文字中的k []〜`'）

Answer 7

我同意gnibbler，正则表达式是这种情况的过度杀手。可能在删除这些不需要的字符之后你也想要删除不需要的字，这里有一个基本的方法：

def remove_bad_words(title):
'''Helper to remove bad words from a sentence based in a dictionary of words.
'''
word_list = title.split(' ')
for word in word_list:
    if word in BAD_WORDS: # BAD_WORDS is a list of unwanted words
        word_list.remove(word)
#let's build the string again
title2 = u''
for word in word_list:
    title2 = ('%s %s') % (title2, word)
    #title2 = title2 + u' '+ word

return title2

Answer 8

无论如何，您需要扫描整个邮件。那么这项工作不会很简单吗？

def checkMessage(topic_message):
  for char in topic_message:
       if char in "<>/\{}[]~`":
           return False
  return True

Answer 9

示例：根据您的需求量身定制。

### valid chars: 0-9 , a-z, A-Z only
import re
REGEX_FOR_INVALID_CHARS=re.compile( r'[^0-9a-zA-Z]+' )
list_of_invalid_chars_found=REGEX_FOR_INVALID_CHARS.findall( topic_message )

在python中搜索无效字符的有效方法

9 个答案: