熊猫中分层抽样不成比例

时间:2020-02-01 15:08:50

标签: python pandas dataframe random sampling

如何在以下数据框中从每个组(列Name)中随机选择一行:

   Distance   Name  Time  Order
1        16   John     5      0
4        31   John     9      1
0        23   Kate     3      0
3        15   Kate     7      1
2        32  Peter     2      0
5        26  Peter     4      1

预期结果:

Distance   Name  Time  Order

4        31   John     9      1
0        23   Kate     3      0
2        32  Peter     2      0

6 个答案:

答案 0 :(得分:5)

您可以在#pragma once #include <iostream> #include <windows.h> #include <stdio.h> #include <evntrace.h> #include <tdh.h> #pragma comment(lib, "tdh.lib") using namespace std; class NetshReader { public: void processNetshTrace(); void WINAPI processFirstPass(PEVENT_RECORD pEvent); }; void WINAPI NetshReader::processFirstPass(PEVENT_RECORD pEvent) { std::wcout << "In callback function" << std::endl; } void NetshReader::processNetshTrace() { std::wstring stemp = L"C:\\traces\\a7-netsh.etl"; EVENT_TRACE_LOGFILE trace; TRACE_LOGFILE_HEADER* pHeader = &trace.LogfileHeader; TRACEHANDLE g_hTrace = 0; // Handle to the trace file that you opened. ZeroMemory(&trace, sizeof(EVENT_TRACE_LOGFILE)); trace.LogFileName = &stemp[0]; trace.EventRecordCallback = (PEVENT_RECORD_CALLBACK)(&NetshReader::processFirstPass, this); trace.ProcessTraceMode = PROCESS_TRACE_MODE_EVENT_RECORD; g_hTrace = OpenTrace(&trace); if (INVALID_PROCESSTRACE_HANDLE == g_hTrace) std::wcout << "OpenTrace failed" << std::endl; ProcessTrace(&g_hTrace, 1, 0, 0); // <<=== Access violation here because tries to // callback to NetshReader object address // (i.e. "this") } int wmain(int argc, wchar_t** argv) { NetshReader* rdr = new NetshReader(); rdr->processNetshTrace(); return(0); } 栏上使用groupby并应用sample

Name

df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)

答案 1 :(得分:2)

您可以使用numpy函数random.permutation对所有样本进行混洗。然后groupbyName并从每个组中提取N个第一行:

df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)

答案 2 :(得分:1)

您可以使用unique

df['Name'].unique()

答案 3 :(得分:0)

随机播放数据框:

df.sample(frac=1)

然后删除重复的行:

df.drop_duplicates(subset=['Name'])

答案 4 :(得分:0)

df.drop_duplicates(subset='Name')



   Distance   Name  Time  Order
1        16   John     5      0
0        23   Kate     3      0
2        32  Peter     2      0

这应该有帮助,但这不是随机选择,它保留了第一个

答案 5 :(得分:0)

如何使用random

像这样

导入您提供的数据,

df=pd.read_csv('random_data.csv', header=0)

看起来像这样

Distance  Name  Time  Order
1        16  John     5      0
4         3  John     9      1
0        23  Kate     3      0
3        15  Kate     7      1

然后获得一个随机的列名,

colname = df.columns[random.randint(1, 3)]

并在其下方选择了“名称”,

   print(df[colname])
1    John
4    John
0    Kate
3    Kate
Name: Name, dtype: object

我当然可以将其浓缩为

print(df[df.columns[random.randint(1, 3)]])