Question

作为一项学习练习，我编写了自己的标记化和词袋以及tfidf向量化功能。在Jupyter Notebook中定义时，它们运行良好。为了整理事情，我将这些功能和一些子功能保存到.py文件中，可以从我的项目目录中./src/features/中的简单模块中导入这些文件。

其中之一的示例是：

# extract_tokens.py
def extract_tokens(text,ngrams,stem):

    """This function takes selftext from a Reddit post, strips URLs, newline formatting strings,
    and non alphabetic characters, removes stopwords, and returns a lower case list of words """

    import re
    from nltk.corpus import stopwords as sw
    from nltk.stem.snowball import SnowballStemmer

    ...

    return tokens

# get_tokens.py
def get_tokens(texts,ngrams,stem):

    """extracts unique relevant word tokens from a list of texts using extract_tokens()"""

    ...

    return tokens


# gen_bow_vecs.py
def gen_bow_vecs(texts,ngrams=1,stem=0,**vocab):

    """Create bag-of-words vector representations from a Series of strings"""
    from nltk.corpus import stopwords 
    import numpy as np
    from scipy.sparse import csr_matrix

    ...

    return bvecs

我遇到了一个问题，即这些函数在加载时无法“看到”其他导入的模块，例如re或nltk，因此相当粗暴地导入了函数中的依赖项。不幸的是，上面的脚本是在我的语料库中的每个文档上单独调用的，我认为这确实减慢了速度。

我的函数在/src/foo/custom.py中成组定义-我可以在该文件的顶部添加几行，以便每当该模块或其中的任何单个函数被加载时，都将加载依赖项叫吗？

custom.py文件的结构为：

{{1}}

等

如何使自定义模块/功能可以使用依赖项（NumPy等）？

0 个答案: