Question

因为我想在训练数据时消除歧义。我想把它清理干净。那么如何在python中删除所有少于3个单词的行呢？

Answer 1

你好，世界！这将是我对SO的第一个贡献：-)

让我们创建一些数据：

data = { 'Source':['Hello all Im Happy','Its a lie, dont trust him','Oops','foo','bar']}
df = pd.DataFrame (data, columns = ['Source'])

我的方法非常简单，简单，几乎没有“粗鲁”并且效率低下，但是我在一个大型数据帧（1013952行）中运行了该方法，并且时间是可以接受的。让我们找到令牌数量超过n个的数据帧的索引：

from nltk.tokenize import word_tokenize


def get_indices(df,col,n): 
"""
Get the indices of dataframe where exist more than n tokens in a specific column

Parameters:

   df(pandas dataframe)
   n(int): threshold value for minimum words
   col(string): column name 

"""      


tmp = []
for i in range(len(df)):#df.iterrows() wasnt working for me
    if len(word_tokenize(df[col][i])) < n:
        tmp.append(i)
return tmp

接下来，我们只需要调用函数并删除行和所说的索引：

tmp = get_indices(df)
df_clean = df.drop(tmp)

最好！

如何删除数据框中的3个单词或更少的行？

1 个答案: