在其中定义几个可靠的函数。蟒蛇

时间:2019-03-18 20:18:41

标签: python pandas function stringtokenizer

例如,数据框为:

df = pd.DataFrame(data = {'id': ['393848', '30495'],
                         'text' : ['This is Gabanna. @RT Her human Jose rushed past firefighters into his burning home to rescue her. She suffered burns on her nose and paws, but will be just fine. The family lost everything else. You can help them rebuild below. 14/10 for both (via @KUSINews)',
                                  'Meet Milo. He’s a smiley boy who tore a ligament in his back left zoomer. The surgery to fix it went well, but he’s still at the hospital being monitored. He’s going to work very hard to fetch at full speed again, and you can help him do it below. 13/10']
                         })

我写了一些函数:

def tokenize(df): 
    def process_tokens(df): #return column with lists of tokens
        def process_reg(text): #return plain text
            return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', "", str(text)).split()])
        df['tokens'] = [process_reg(text).split() for text in df['text']]
    return process_tokens(df) 

tokenize(df)

def process(df): #return column with dicts
    def process_group(token): #convert list of tokens into dictionery
            return pd.DataFrame(token, columns=["term"]).groupby('term').size().to_dict()
    df['dic'] = [process_group(token) for token in df['tokens']]

process(df)

他们一个接一个地工作,我得到了期望的结果:

我正在寻找一种解决方案,将所有功能嵌套到一个函数中,以便仅一次传递数据帧。

找不到。

请帮助

1 个答案:

答案 0 :(得分:0)

class Model1<T: A>: A {
    var a: T?
}

然后...

def ad (df):
    def tokenize(df): #return column with dicts
        def process_tokens(df): #return column with lists of tokens
            def process_reg(text): #return plain text
                return " ".join([i for i in re.sub(r'[^a-zA-Z\s]', "", str(text)).split()])
            df['tokens'] = [process_reg(text).split() for text in df['text']]
        return process_tokens(df)

    tokenize(df)

    def process (df):
        def process_dic(df): #return column with dicts
            def process_group(token): #convert list of tokens into dictionery
                return pd.DataFrame(token, columns=["term"]).groupby('term').size().to_dict()
            df['dic'] = [process_group(token) for token in df['tokens']]
        return process_dic(df)

    return process(df)

效果很好。尽管我有一个想法,那就是另一种编写方式会更快。...又是一天的挑战。

感谢您的支持,@ Goyo!