Question

我有几个处理字符串的函数。尽管它们采用不同种类的参数，但它们都采用一个称为tokenizer_func（默认为str.split）的通用参数，该参数基本上根据提供的函数将输入字符串分成标记列表。然后，在每个函数中修改返回的字符串列表。由于tokenizer_func似乎是一个常用参数，并且是所有函数中存在的第一行代码，我想知道使用装饰器来装饰字符串修改函数是否会更容易。基本上，装饰器将使用tokenizer_func，将其应用于传入的字符串并调用适当的字符串修改函数。

编辑2

我能够找到解决方案（也许是hacky？）：

def tokenize(f):
  def _split(text, tokenizer=SingleSpaceTokenizer()):      
    return tokenizer.decode(f(tokenizer.encode(text)))
  return _split

@tokenize
def change_first_letter(token_list, *_):
  return [random.choice(string.ascii_letters) + token[1:] for token in token_list]

通过这种方式，我可以调用change_first_letter(text)使用默认标记器，并调用change_first_letter(text, new_tokenizer)使用new_tokenizer。如果有更好的方法，请告诉我。

编辑1：

在查看对这个问题的第一份答复之后，我认为我可以将这个问题归纳为更好地处理更多涉及的分词器。具体来说，我现在有这个：

class Tokenizer(ABC):
  """ 
  Base class for Tokenizer which provides the encode and decode methods
  """
  def __init__(self, tokenizer: Any) -> None:
    self.tokenizer = tokenizer

  @abstractmethod
  def encode(self, text: str) -> List[str]:
    """
    Tokenize a string into list of strings

    :param datum: Text to be tokenized
    :return: List of tokens
    """

  @abstractmethod
  def decode(self, token_list : List[str]) -> str:
    """
    Creates a string from a tokens list using the tokenizer

    :param data: List of tokens
    :return: Reconstructed string from token list
    """

  def encode_many(self, texts: List[str]) -> List[List[str]]:
    """
    Encode multiple strings

    :param data: List of strings to be tokenized
    :return: List of tokenized strings
    """    
    return [self.encode(text) for text in texts]

  def decode_many(self, token_lists: List[List[str]]) -> List[str]:
    """
    Decode multiple strings

    :param data: List of tokenized strings
    :return: List of reconstructed strings
    """        
    return [self.decode(token_list) for token_list in token_lists]

class SingleSpaceTokenizer(Tokenizer):
  """ 
  Simple tokenizer that just splits a string on a single space using str.split
  """
  def __init__(self, tokenizer=None) -> None:
    super(SingleSpaceTokenizer, self).__init__(tokenizer)

  def encode(self, text: str) -> List[str]:
    return text.split()    

  def decode(self, token_list: List[str]) -> str:
    return ' '.join(token_list)

我已经基于回复和搜索编写了一个装饰器函数：

def tokenize(tokenizer):
  def _tokenize(f):
    def _split(text):      
      response = tokenizer.decode(f(tokenizer.encode(text)))
      return response
    return _split
  return _tokenize

现在我可以做到这一点：

@tokenize(SingleSpaceTokenizer())
def change_first_letter(token_list):
  return [random.choice(string.ascii_letters) + token[1:] for token in token_list]

这可以正常工作。如何让我作为用户想使用另一个令牌生成器：

class AtTokenizer(Tokenizer):
  def __init__(self, tokenizer=None):
    super(AtTokenizer, self).__init__(tokenizer)
  
  def encode(self, text):
    return text.split('@')

  def decode(self, token_list):
    return '@'.join(token_list)

new_tokenizer = AtTokenizer()

如何通过传递此new_tokenzer来调用我的文本函数？

我发现我可以这样称呼new_tokenizer：

tokenize(new_tokenizer)(change_first_letter)(text)

如果我请勿装饰change_first_letter函数。这似乎很乏味吗？有没有一种方法可以更简洁地做到这一点？

原始：

以下是两个此类函数的示例（第一个是虚拟函数）：

def change_first_letter(text: str, tokenizer_func: Callable[[str], List[str]]=str.split) -> str:
 words = tokenizer_func(text)
 return ' '.join([random.choice(string.ascii_letters) + word[1:] for word in words])

def spellcheck(text: str, tokenizer_func: Callable[[str], List[str]]=str.split) -> str:
 words = tokenizer_func(text)
 return ' '.join([SpellChecker().correction(word) for word in words])

您可以同时使用这两个函数的第一行是应用tokenizer函数。如果tokenizer函数始终为str.split，则我可以创建一个装饰器来为我完成此操作：

def tokenize(func):
 def _split(text):
  return func(text.split())
 return _split

然后，我可以用@tokenize装饰其他功能，它将起作用。在这种情况下，函数将直接采用List[str]。但是，tokenizer_func由函数调用者提供。我如何将其传递给装饰者？能做到吗？

Answer 1

def tokenize(tokenizer):
  def _tokenize(f):
    def _split(text, tokenizer=tokenizer):      
      response = tokenizer.decode(f(tokenizer.encode(text)))
      return response
    return _split
  return _tokenize

通过这种方式，您可以通过两种方式呼叫change_first_letter：

change_first_letter(text)使用默认令牌生成器
change_first_letter(text, new_tokenizer)使用new_tokenizer

当装饰器更改函数接受的参数时，MyPy不喜欢它，因此，如果您使用的是MyPy，则可能要为其编写插件。

Answer 2

装饰器的@语法只是将行的其余部分评估为一个函数，在随后立即定义的函数上调用该函数，然后将其替换。通过使“带参数的装饰器”（tokenize()）返回一个常规装饰器，该装饰器将包含原始功能。

def tokenize(method):
    def decorator(function):
        def wrapper(text):
            return function(method(text))
        return wrapper
    return decorator

@tokenize(method=str.split)
def strfunc(text):
    print(text)

strfunc('The quick brown fox jumped over the lazy dog')
# ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

这样做的问题是，如果要分配默认参数（例如def tokenize(method=str.split):），则在应用装饰器时仍需要将其作为函数调用：

@tokenize()
def strfunc(text):
    ...

，因此最好不要给出默认参数，或者找到解决此问题的创造性方法。一种可能的解决方案是，根据装饰器的行为（取决于它是通过函数调用（在这种情况下装饰该函数）还是在字符串（在这种情况下调用str.split()）来更改：

def tokenize(method):
    def decorator(arg):
        # if argument is a function, then apply another decorator
        # otherwise, assume str.split()
        if type(arg) == type(tokenize):
            def wrapper(text):
                return arg(method(text))
            return wrapper
        else:
            return method(str.split(arg))
    return decorator

应允许以下两项：

@tokenize             # default to str.split
def strfunc(text):
    ...

@tokenize(str.split)  # or another function of your choice
def strfunc(text):
    ...

这样做的缺点是有点笨拙（始终与type()一起玩，这里尤其要注意的是，所有函数都是函数；您可以查看是否可以检查“是可调用的”，如果您也希望它也适用于类，则将很难确定哪个参数在tokenize()内部起什么作用-因为它们会根据调用方法的方式而改变用途。 / p>

如何设置接受函数调用者提供的参数的装饰器？

2 个答案: