Question

我有一个定义函数define_stop_words的类，它返回一个字符串标记列表。然后我开始应用另一个名为remove_stopwords的函数，它将原始的utf8文本作为输入，转换为包含文本的pandas数据帧df。代码看起来像这样

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question

    def define_stop_words(self):
        names = ['john','sally','billy','sarah']
        stops = ['if','the','in','then','an','a']
        return stops+names

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words()]


 import pandas as pd
 df = pd.read_csv('data.csv')
 parse = ProcessText(flag=True)
 df['text'] = df['text'].apply(parse.remove_stopwords())

我的问题是，remove_stopwords函数是否会调用并定义每次define_stop_words返回的变量 - 对于text中每一行df中的每一个字（{1}}对于每次迭代基本上）？

如果是这种情况，我不希望它像这样运行，因为它会非常缓慢和低效。我想定义define_stop_words 返回的变量，几乎就像ProcessText类中的“全局变量”，然后在remove_stopwords多个中使用该变量时间（对于df中的每个单词和行）。

有没有办法做到这一点 - 应该这样做吗？在这种情况下，最佳做法是什么？

Answer 1

您可以将这些名称分配给类变量

class ProcessText:
   names = ['john','sally','billy','sarah']
   stops = ['if','the','in','then','an','a']

   def __init__(self, flag):
       self.flag = flag # not important for this question

   def remove_stopwords(self, text):
       return [word for word in text if word not in self.names + self.stops]


import pandas as pd
df = pd.read_csv('data.csv')
parse = ProcessText(flag=True)
df['text'] = df['text'].apply(parse.remove_stopwords())

这些类变量由所有实例继承。每次创建新实例时，__ init __（）方法中的赋值将导致多个赋值。

Answer 2

您可以缓存列出的单词，在init中设置它们，以便只调用一次操作。然后，不要使用define_stop_words（）函数，而是将其作为属性。

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question
        self._names = ['john','sally','billy','sarah']
        self._stops = ['if','the','in','then','an','a']

    @property
    def define_stop_words(self):
        return self._stops + self._names

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words]

请注意，在python中，没有真正的私有变量概念（我认为你想在这里使用它 - 你不希望用户在创建后能够覆盖这些列表吗？< / em>的）。这意味着您的代码的不道德用户仍然可以在初始化之后更新ProcessText对象中的_names和_stops属性，这意味着您会得到意外的结果。

要考虑的另一件事是使用集合而不是列表（，特别是如果性能是一个问题），因为散列会更快。

当然，再次组合列表并缓存组合集会更快，而不是每次调用属性时都执行'add'（以便属性调用只返回缓存集）如果你进一步挑剔！

e.g

class ProcessText: def __init__(self, flag): self.flag = flag # not important for this question _names = {'john','sally','billy','sarah'} _stops = {'if','the','in','then','an','a'} self._stops_and_names = _names.union(_stops) @property def define_stop_words(self): return self._stops_and_names def remove_stopwords(self, text): return [word for word in text if word not in self.define_stop_words]

Answer 3

每次调用define_stop_words方法时，remove_stopwords方法只会调用一次。

一种方法只能在每个实例上调用一次，但在初始化实例时却没有（因为你可能有许多这些方法，所有方法都很昂贵，而且你并不总是需要所有这些方法），是使用这样的东西：

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question
        self._stop_words = None

    @property
    def stop_words(self):
        if self._stop_words is None:
            self._stop_words = set(['john','sally','billy','sarah'])
            self._stop_words |= set(['if','the','in','then','an','a'])
        return self._stop_words

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words]

如何定义将在python中的类中的函数的每次迭代中使用的变量？

3 个答案: