Question

我在pandas列中有一堆推文。我已经制作了一个类来处理文本的所有各个方面，例如，删除标点符号，展开收缩，删除特殊字符等。我已经成功地使用该类处理了各个行，但是，我不知道该如何申请整个文本列的方法。

class ProcessTweetText:
    def __init__(self, text):
        self.text = text

    def remove_web_link(self):
        self.text = re.sub(r"http\S+", "", self.text)
        return self.text

    def remove_html(self):
        self.text = self.text.replace('\n', ' ')
        return self.text

    def replace_contractions(self):
        return contractions.fix(self.text)

    def remove_hyphen(self):
        self.text = self.text.replace('—', ' ')
        self.text = self.text.replace('-', ' ')
        return self.text

    def remove_mentions(self):
        self.text = re.sub('@[A-Za-z0-9_]\S+', '', self.text)
        return self.text

    def remove_hashtags(self):
        self.text = re.sub('#[A-Za-z0-9_]\S+', '', self.text)
        return self.text

    def remove_punctuation(self):
        self.text = ''.join([c for c in self.text if c not in string.punctuation])
        return self.text

    def remove_special_characters(self):
        self.text = re.sub('[^a-zA-Z0-9 -]', '', self.text)
        return self.text

    def process_text(self):
        example.remove_web_link()
        example.remove_html()
        example.replace_contractions()
        example.remove_hyphen()
        example.remove_hyphen()
        example.remove_mentions()
        example.remove_hashtags()
        example.remove_punctuation()
        example.remove_special_characters()


example = ProcessTweetText(df['original_tweets'][100])
example.process_text()
example.text

也许这不是解决此问题的正确方法，因为我仍然不熟悉使用类。但是，任何对熊猫列进行所需更改的帮助将不胜感激。谢谢大家！

Answer 1

如果您想保留自己的结构，可以使用以下内容：

def foo(text):
    example = ProcessTweetText(text)
    example.process_text()
    return example.text

df['original_tweets'].apply(foo)

但是实际上我看不出为此目的使用类的意义。您可以这样简单地做到：

def foo(text):
    text = re.sub(r"http\S+", "", text)
    text = text.replace('\n', ' ')
    text = text.replace('—', ' ')
    text = text.replace('-', ' ')
    text = re.sub('@[A-Za-z0-9_]\S+', '', text)
    text = re.sub('#[A-Za-z0-9_]\S+', '', text)
    text = ''.join([c for c in text if c not in string.punctuation])
    text = re.sub('[^a-zA-Z0-9 -]', '', text)
    return text

df['original_tweets'].apply(foo)

如何将类方法应用于熊猫列

1 个答案: