缓存和unhashables-返回列表的缓存函数有效,但其他函数不起作用,为什么?

时间:2019-11-18 15:34:24

标签: python list caching hashable

我正在研究一些NLP函数,这些函数将一些文本作为输入并从word2vec嵌入生成的字典中返回这些单词的索引

大多数输入是相同的,所以我考虑使用 functools.lru_cache 来获得更好的时间,这是我尝试使用的代码:

@functools.lru_cache(maxsize=None)
def clean_description(text, stopwords=[]):
    """
    :param text: str: Text string to be cleaned
    :param stopwords: list: List of stopwords to be removed
    :return tokens: list: List of clean and valid tokens
    """
    text = str(text)
    text = text.encode('ascii', 'ignore').decode('utf-8')
    text = re.sub(r'\u001A', '_', text)
    text = re.sub(r'_x005F_x001A_', '_', text)
    text = text.upper()
    text = re.sub(r'X001A', '_', text)
    text = re.sub(r'.*CONCEPTO', '', text)
    text = re.sub(r'<NUM>', '', text)
    text = re.sub(r'(^|\s)S\.?[AL]\.?($|\s)', ' ', text)
    text = re.sub(r"['`´\"]", '', text)
    text = re.sub(r'[\\\t\.\*\+\-,_/;:\(\)\{\}\[\]\^º\|#~\<\>@\\=\?\¿\!\¡]', ' ', text)
    text = re.sub(r'(?<!\w)(\d*)(?!\w)', ' ', text)
    text = re.sub(r'\ {2,}', ' ', text)
    tokens = text.split()
    tokens = [token for token in tokens if token not in stopwords]

    print(type(tokens), tokens, tokens.__hash__)

    return tokens

@functools.lru_cache(maxsize=None)
def get_word_indexes(tokens, word2index):
    """
    :param tokens: list: List of tokens
    :param word2index: dict: Key => token, value => index
    :return indexes: list: List of indexes for each token, in order
    """
    indexes = [word2index[token] if token in word2index else 0 for token in tokens]

    print(type(indexes), indexes, indexes.__hash__)

    return indexes

据我了解,列表类型在Python中是不可散列的,因此这些函数应该失败,因为它们都返回列表。

如果我尝试运行此命令:

tokens = clean_description(df_exp['description'][50])
indexes = get_word_indexes(tokens, word2index)

输出为:

<class 'list'> ['TARGET', 'KITOPAMA'] None

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-51-b88c19eb38a2> in <module>
     39 
     40 tokens = clean_description(df_exp['description'][50])
---> 41 indexes = get_word_indexes(tokens)
     42 
     43 clean_description.cache_info()

TypeError: unhashable type: 'list'

如您所见,第一个功能 clean_description起作用,而另一个功能不起作用。我能找到的唯一解释是,一个可行的解释是返回一个字符串列表,另一个返回整数列表。因此,我尝试将这些整数转换为字符串,然后返回它们:

@functools.lru_cache(maxsize=None)
def get_word_indexes(tokens, word2index):
    """
    :param tokens: list: List of tokens
    :param word2index: dict: Key => token, value => index
    :return indexes: list: List of indexes for each token, in order
    """
    indexes = [**str(word2index[token])** if token in word2index else 0 for token in tokens]

    print(type(indexes), indexes, indexes.__hash__)

    return indexes

输出是相同的...

如果我尝试从clean_description访问缓存信息,则可以使用!:

tokens = clean_description(df_exp['description'][50])
clean_description.cache_info()
<class 'list'> ['TARGET', 'KITOPAMA'] None

CacheInfo(hits=0, misses=1, maxsize=None, currsize=1)

我还尝试将输入更改为字符串,而不是字符串列表,并且输出相同。

有人可以给我一个解释吗?为什么一个在起作用而另一个在不起作用?我该如何解决?

谢谢!第一篇文章:)

0 个答案:

没有答案