Pyspark-指定火车测试拆分的实际大小而不是比率?

时间:2020-08-16 08:06:42

标签: pyspark apache-spark-sql apache-spark-mllib pyspark-dataframes

是否可以通过指定我想要的实际大小而不是使用比率将数据帧分为训练集和测试集?我看到大多数示例都使用randomSplit。.

463715个训练样本

51630个测试样品

在scikit-learn中,我能够做到这一点,例如:

{
   "users":[
      {
         "customerId":"2kXE3upOg5hnOG",
         "ccoId":"paalle",
         "userGroups":[
            "CX Cloud Super Admins",
            "CX Cloud Admins",
            "aAutoGroupMarked12"
         ],
         "emailId":"paalle@test.com",
         "fullName":"Pavan Alle",
         "isSelected":true
      },
      {
         "customerId":"2kXE3upOg5hnOG",
         "ccoId":"rtejanak",
         "userGroups":[
            "aTestUserGroupname1234"
         ],
         "emailId":"rtejanak@test.com",
         "fullName":"Raja Ravi Teja Nakirikanti"
      }
   ],
   "pagination":{
      "pageNumber":1,
      "totalPages":2,
      "rowPerPage":10,
      "totalRows":11
   }
}

0 个答案:

没有答案