Question

我怎样才能从最少的数据中平均分割数据帧？

假设我总共有 625 个数据（300 个垃圾邮件，325 个不是垃圾邮件）
所以我必须丢弃那 25 个随机非垃圾邮件，使其成为 300 个垃圾邮件和 300 个非垃圾邮件

重要：
我正在使用熊猫库
数据总量不固定
垃圾邮件和非垃圾邮件数据的比例不固定

我现在在做什么：

import pandas as pd

df = pd.read_csv('directory/dataset.csv')
df[label].value_counts() #show total spam, total not spam

这是我的数据帧的标题：

<头>

一句话	标签
福哥	垃圾邮件
那还挺酷的	不是垃圾邮件
哈哈好笑	不是垃圾邮件
你不能做别的东西吗mtfk	垃圾邮件
可惜了	垃圾邮件

Answer 1

这种方法将使用 pd.sample 将每个标签行数减少到最小标签数，并将结果重新连接在一起。这也适用于任意数量的标签。

const bsv = require("bsv");
var tx1 = '3a459eab5f0cf8394a21e04d2ed3b2beeaa59795912e20b9c680e9db74dfb18c';
var tx2 = 'be38f46f0eccba72416aed715851fd07b881ffb7928b7622847314588e06a6b7';

bsv.crypto.Hash.sha256sha256(Buffer.concat(
    [ tx1, tx2 ].map( v => Buffer.from(v, 'hex').reverse() )
)).reverse().toString('hex');

输出

import pandas as pd

df = pd.DataFrame({'sentence': {0: 'FU bro',
  1: 'Well thats kinda cool',
  2: 'Haha thats so funny',
  3: 'cant u make somethin else mtfk',
  4: 'what a shame'},
 'label': {0: 'spam', 1: 'not spam', 2: 'not spam', 3: 'spam', 4: 'spam'}})


df = pd.concat([df.loc[df['label']==l].sample(df.label.value_counts().min()) for l in df.label.unique()])

print(df)

按标签拆分和删除熊猫数据框

1 个答案: