Question

我正在研究一些NLP函数，这些函数将一些文本作为输入并从word2vec嵌入生成的字典中返回这些单词的索引。

大多数输入是相同的，所以我考虑使用 functools.lru_cache 来获得更好的时间，这是我尝试使用的代码：

@functools.lru_cache(maxsize=None)
def clean_description(text, stopwords=[]):
    """
    :param text: str: Text string to be cleaned
    :param stopwords: list: List of stopwords to be removed
    :return tokens: list: List of clean and valid tokens
    """
    text = str(text)
    text = text.encode('ascii', 'ignore').decode('utf-8')
    text = re.sub(r'\u001A', '_', text)
    text = re.sub(r'_x005F_x001A_', '_', text)
    text = text.upper()
    text = re.sub(r'X001A', '_', text)
    text = re.sub(r'.*CONCEPTO', '', text)
    text = re.sub(r'<NUM>', '', text)
    text = re.sub(r'(^|\s)S\.?[AL]\.?($|\s)', ' ', text)
    text = re.sub(r"['`´\"]", '', text)
    text = re.sub(r'[\\\t\.\*\+\-,_/;:\(\)\{\}\[\]\^º\|#~\<\>@\\=\?\¿\!\¡]', ' ', text)
    text = re.sub(r'(?<!\w)(\d*)(?!\w)', ' ', text)
    text = re.sub(r'\ {2,}', ' ', text)
    tokens = text.split()
    tokens = [token for token in tokens if token not in stopwords]

    print(type(tokens), tokens, tokens.__hash__)

    return tokens

@functools.lru_cache(maxsize=None)
def get_word_indexes(tokens, word2index):
    """
    :param tokens: list: List of tokens
    :param word2index: dict: Key => token, value => index
    :return indexes: list: List of indexes for each token, in order
    """
    indexes = [word2index[token] if token in word2index else 0 for token in tokens]

    print(type(indexes), indexes, indexes.__hash__)

    return indexes

据我了解，列表类型在Python中是不可散列的，因此这些函数应该失败，因为它们都返回列表。

如果我尝试运行此命令：

tokens = clean_description(df_exp['description'][50])
indexes = get_word_indexes(tokens, word2index)

输出为：

<class 'list'> ['TARGET', 'KITOPAMA'] None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-51-b88c19eb38a2> in <module>
     39 
     40 tokens = clean_description(df_exp['description'][50])
---> 41 indexes = get_word_indexes(tokens)
     42 
     43 clean_description.cache_info()

TypeError: unhashable type: 'list'

如您所见，第一个功能 clean_description起作用，而另一个功能不起作用。我能找到的唯一解释是，一个可行的解释是返回一个字符串列表，另一个返回整数列表。因此，我尝试将这些整数转换为字符串，然后返回它们：

@functools.lru_cache(maxsize=None)
def get_word_indexes(tokens, word2index):
    """
    :param tokens: list: List of tokens
    :param word2index: dict: Key => token, value => index
    :return indexes: list: List of indexes for each token, in order
    """
    indexes = [**str(word2index[token])** if token in word2index else 0 for token in tokens]

    print(type(indexes), indexes, indexes.__hash__)

    return indexes

输出是相同的...

如果我尝试从clean_description访问缓存信息，则可以使用！：

tokens = clean_description(df_exp['description'][50])
clean_description.cache_info()

<class 'list'> ['TARGET', 'KITOPAMA'] None

CacheInfo(hits=0, misses=1, maxsize=None, currsize=1)

我还尝试将输入更改为字符串，而不是字符串列表，并且输出相同。

有人可以给我一个解释吗？为什么一个在起作用而另一个在不起作用？我该如何解决？

谢谢！第一篇文章：）

缓存和unhashables-返回列表的缓存函数有效，但其他函数不起作用，为什么？

0 个答案: