我正在研究一些NLP函数,这些函数将一些文本作为输入并从word2vec嵌入生成的字典中返回这些单词的索引。
大多数输入是相同的,所以我考虑使用 functools.lru_cache 来获得更好的时间,这是我尝试使用的代码:
@functools.lru_cache(maxsize=None)
def clean_description(text, stopwords=[]):
"""
:param text: str: Text string to be cleaned
:param stopwords: list: List of stopwords to be removed
:return tokens: list: List of clean and valid tokens
"""
text = str(text)
text = text.encode('ascii', 'ignore').decode('utf-8')
text = re.sub(r'\u001A', '_', text)
text = re.sub(r'_x005F_x001A_', '_', text)
text = text.upper()
text = re.sub(r'X001A', '_', text)
text = re.sub(r'.*CONCEPTO', '', text)
text = re.sub(r'<NUM>', '', text)
text = re.sub(r'(^|\s)S\.?[AL]\.?($|\s)', ' ', text)
text = re.sub(r"['`´\"]", '', text)
text = re.sub(r'[\\\t\.\*\+\-,_/;:\(\)\{\}\[\]\^º\|#~\<\>@\\=\?\¿\!\¡]', ' ', text)
text = re.sub(r'(?<!\w)(\d*)(?!\w)', ' ', text)
text = re.sub(r'\ {2,}', ' ', text)
tokens = text.split()
tokens = [token for token in tokens if token not in stopwords]
print(type(tokens), tokens, tokens.__hash__)
return tokens
@functools.lru_cache(maxsize=None)
def get_word_indexes(tokens, word2index):
"""
:param tokens: list: List of tokens
:param word2index: dict: Key => token, value => index
:return indexes: list: List of indexes for each token, in order
"""
indexes = [word2index[token] if token in word2index else 0 for token in tokens]
print(type(indexes), indexes, indexes.__hash__)
return indexes
据我了解,列表类型在Python中是不可散列的,因此这些函数应该失败,因为它们都返回列表。
如果我尝试运行此命令:
tokens = clean_description(df_exp['description'][50])
indexes = get_word_indexes(tokens, word2index)
输出为:
<class 'list'> ['TARGET', 'KITOPAMA'] None
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-51-b88c19eb38a2> in <module>
39
40 tokens = clean_description(df_exp['description'][50])
---> 41 indexes = get_word_indexes(tokens)
42
43 clean_description.cache_info()
TypeError: unhashable type: 'list'
如您所见,第一个功能 clean_description起作用,而另一个功能不起作用。我能找到的唯一解释是,一个可行的解释是返回一个字符串列表,另一个返回整数列表。因此,我尝试将这些整数转换为字符串,然后返回它们:
@functools.lru_cache(maxsize=None)
def get_word_indexes(tokens, word2index):
"""
:param tokens: list: List of tokens
:param word2index: dict: Key => token, value => index
:return indexes: list: List of indexes for each token, in order
"""
indexes = [**str(word2index[token])** if token in word2index else 0 for token in tokens]
print(type(indexes), indexes, indexes.__hash__)
return indexes
输出是相同的...
如果我尝试从clean_description访问缓存信息,则可以使用!:
tokens = clean_description(df_exp['description'][50])
clean_description.cache_info()
<class 'list'> ['TARGET', 'KITOPAMA'] None
CacheInfo(hits=0, misses=1, maxsize=None, currsize=1)
我还尝试将输入更改为字符串,而不是字符串列表,并且输出相同。
有人可以给我一个解释吗?为什么一个在起作用而另一个在不起作用?我该如何解决?
谢谢!第一篇文章:)