Question

我有一个数据框包含25000行和两列（text，class）类包含许多[A，B，C]

data = pd.read_csv('E:\mydata.txt', sep="*")
data.columns = ["text", "class"]

例如，我需要删除A类的10行，B类的15行

Answer 1

您可以通过条件切片和数据框的index属性来实现此目的

remove_n = 10
remove_class = 1
# Here you first find the indexes where class is equal to the class you want to drop.
#Then you slice only the first n indexes of this class
index_to_drop = data.index[data['class'] == remove_class][:remove_n]
#Finally drop those indexes
data = data.drop(index_to_drop)

Answer 2

您可以通过np.logical_and和groupby.cumcount构造一个布尔掩码。然后通过iloc将其应用于您的数据框：

# example dataframe
df = pd.DataFrame({'group': np.random.randint(0, 3, 100),
                   'value': np.random.random(100)})

print(df.shape)  # (100, 2)

# criteria input
criteria = {1: 10, 2: 15}

# cumulative count by group
cum_count = df.groupby('group').cumcount()

# Boolean mask, negative via ~
conditions = [(df['group'].eq(k) & cum_count.lt(v)) for k, v in criteria.items()]
mask = ~np.logical_or.reduce(conditions)

# apply Boolean mask
res = df.iloc[mask]

print(res.shape)  # (75, 2)

删除数据框中的行数

2 个答案: