python机器学习变长字符串限定符

时间:2016-12-19 15:58:56

标签: python machine-learning

我正在尝试将可变长度字符串合并到python机器学习数据中。该字符串由21个可能的大写字符组成,长度从3到超过1000,但通常长度为50到500个字符。我想将这些数据添加到现有的机器学习系统中,因为这个字符串是系统中其他数值数据的派生。我希望通过将这些信息纳入系统,可以提高预测的准确性。

使用的机器学习系统:来自scikit-learn的SVR,来自xgboost的梯度增强随机森林,使用Theano和Keras的组合的神经网络。

示例数据(为清晰起见添加了空格,数千套之一): 0.20783132530120485,0.0,0.14759036144578314,0.0,20.500779795353044,-0.012854043345111421,20.856396736982024,-0.019526697858776032,0.17055840352519377 MLKQLLTVVLLAICLINVQAQQLTPPAGTFRLGISKGTDSHWLAPQEKVKGIAFRWKALPDTRGFILEVAVTSLQQADTLFWSFGNCQPDMDINVFSVEGQAFTCYYGESMKLRTLQAVTPTDDIRLSNGRQDKTPLLLYESGKRTDRPVLAGRCPLAANSKLYFCFYEQNARADYNYFMLPDLFAKIDESKHSKK 下,3907.222610216657,0.0,12.957234316695068,260.35949614307845,70.22897891511785,0.0,3600.1557026363694, 6.5695226674325005,8.875805301569174E-9,9.435201047407471E-8,-805.7695207777524,-0.386030775564303,2.4360867449746193E-4,0.001535275768898734,-899.103861896121,0.37012002714844283,41.30865237441297,0.6880193813262029,0.07901855928913903,0.36786993202927,0.027022889508663273,0.20983595671723698,0.004272043781893587,2.6548618772402452,0.8298948072745838,0.4297709789614357, 0.6592421241850477,0.7323455585665695,0.0036084082526088635,1235.9608595043105,-686.3410939120973,517.5695296420 419,0.0,1383.9587599495007,137.6709125154875,48.15897140522527,11.169320592630035,0.0017212126730760488,390.0,576.0,162.0,425.0,-2337.586240324919,-1216.645095553551,-220.7658611143325,-254.87026759361316,-120.44151020211892,-262.1549293391522,-262.70857652215483,-119.78950303227985,-14.056523664351944, - 16.03338970562135,-15.397779250982714,-4.190420980506957,-52.306453723320466,-17.804935707496412,-1602.015046949609,-695.3200007491427,-282.2011792651323,-624.4938669353348,319.12737432671895,-91.65456051126749,190.69831510254096,220.08361973544459,2971.554863316476,262.57174547648316,2708.983117839995,0.0,5.482741129097017,-132.68200592716775,-4341.712499207881 ,9.524948063475861,4.203276705216416,-4.307639899059003,3.1644632985485313,2.81419659034428,2.963504627059134,3.4913480163824713,0.0031707417031467916,0.0,0.0,0.0,0.0,0.0,0.0,0.0015698345827278798,0.0016205522602160554,-1.9645139797143648E5,0.9504047512545211,0.98335286768 85283,0.9597468652322548,0.9865496952192033,0.9175964036143727,16312.662271951838,15062.220268073073,1250.4420038787648,0.0,2.7244897959183674,0.0,0.0,0.0,10.306122448979592,0.0,29.26530612244898,0.0,7.797822706065319,0.0,228.06859068818272,0.4027714206386829,1652.1493757294986,3410.905281836304,0.5612244897959183,0.844845002268259,0.5834395722203105, 1.0,1.0,1797.0,362.119,196.0,1.0,-307.795,0.000,-847.358,202.875,-73.825,2.064,79.019,452.437,-10.090,-45.351,-9.292,-36.652,10.749,-38.050,23.004, -18.505,0.837,0.344

[前9个字段(斜体)是衍生数据,评估其余的数字数据(可能是“Y”),下一个字段(粗体)是需要合并的字符串数据,其余的是机器学习的主要输入(“X”)]

1 个答案:

答案 0 :(得分:0)

您需要考虑该字符串提供的信息?这可以通过某种方式量化吗?

如果您无法从字符串中读取信息,为什么机器能够这样做?