我如何使用Python statsmodels作为R的lm.fit?得到“ValueError:endog和exog的索引没有对齐”

时间:2015-09-21 11:38:38

标签: python statsmodels

我想使用statsmodels计算与R lm.fit相同的结果,但是遇到了一些麻烦。我收到错误ValueError: The indices for endog and exog are not aligned

这些是我的数据,应该是可复制和粘贴的:

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

design_str = """(Intercept) celltypeB
1 1 0
2 1 0
3 1 1
4 1 1"""

logcounts_str = """A_1.bam A_2.bam B_1.bam B_2.bam
653635 10.620219493055 10.6004806223872 10.2558655809612 10.2517277004859
729737 7.67498885090687 7.79906883969567 6.03916081991983 5.75239067063238
100507658 5.08044030135652 4.92805528041574 4.06340836616651 4.59011924173351
100132287 7.74946706686615 7.44831209174908 5.71110662223563 5.87022716092624
100131754 14.9504967198295 15.046928067964 16.615773268211 16.6324137588478
100133331 8.14394324366268 7.97474851586378 6.75643061274512 6.85601330170654
100288069 4.96014606763881 4.38056748511325 5.20091188991645 4.84165800872947
100287934 3.60650911302411 3.00205586185952 3.61594938919529 3.36772682039706
79854 5.72198633044404 5.63432407735903 5.51406977517608 5.4072551845837
643837 7.0659407316614 7.21928657808019 7.23333336760882 7.31674289167825
100506327 5.29456510670937 4.92805528041574 3.3264427720003 3.00515674101235
284600 6.31922716094364 5.88958113260111 2.47844586544535 2.51972991384211
100130417 5.48097823094025 5.80941078391712 0.156517770557991 3.36772682039706
148398 7.78531226581287 7.79906883969567 4.06340836616651 4.84165800872947
26155 10.6251729572891 10.6034552523913 9.36353209073552 9.29846415795994
339451 8.95622036442501 8.90412944117026 7.29606912295679 7.16358610361683
84069 6.54510856835996 6.90894645746804 0.156517770557991 1.7827643196759
84808 5.48097823094025 5.80941078391712 0.156517770557991 0.197801818954747
57801 8.52934125250165 8.40122695567934 7.39492250988307 7.33735317135354
9636 7.40236839624388 7.21928657808019 6.49636777344262 6.85601330170654
375790 12.319609180022 12.2617991255503 10.3720507703036 10.2076304363229
401934 4.34347470719031 3.85005276841447 0.156517770557991 0.197801818954747
54991 8.19813534402627 7.93751560966481 7.05133553386594 7.11666505622934
8784 7.25676307399698 7.4215947533733 4.24398061180833 3.00515674101235
51150 11.167054248766 11.132369007403 10.1802721238574 10.1822202777559
126792 8.95622036442501 8.64012969904024 8.18442376712788 7.80513213270436
388581 4.13702382972289 4.58701836258068 3.85695748869908 2.51972991384211
118424 9.37616332589649 9.50985050205822 8.78222661362246 8.66132619222593
6339 7.2309999779319 6.74621695742993 8.17332605824454 8.16934537290552
116983 10.4038103704178 10.3718712861434 11.6438553857652 11.6941563075194
126789 8.14394324366268 7.81967911937095 6.56590870669569 6.73696063006278
54973 10.1318124030775 10.0702967231723 9.92965697727768 9.95769000217658
80772 8.69867895695197 8.60494027057794 8.28063908238718 8.40237296320395
83756 4.52404695283213 2.26509026769331 3.61594938919529 1.7827643196759
1855 10.4266880754393 10.3365526302499 10.653371547946 10.5739272082309
54587 8.13007106908112 7.83999910375055 7.6878992310743 7.69764770603795
54998 10.0568039370893 9.88714208714969 9.48967312086861 9.47159741816901
81669 10.162802270914 10.2095583211183 9.81294263383577 9.64695046433018
148413 8.28833315299785 7.97474851586378 7.07538100783259 7.31674289167825
55052 8.96406111764219 8.95159079487653 8.54453505590313 8.65312903925931
441869 4.13702382972289 4.58701836258068 8.23866681191186 8.34246006178663
643965 0.436584111581795 0.680127766972157 7.89122739078383 7.49242256784637
64856 8.61150979408247 8.78341557538418 10.2212605353082 10.1091938067982
219293 5.92843720791147 6.23471661864979 4.80037396033272 5.32708483589971
83858 9.25995135162803 9.31312296411511 7.05133553386594 7.31674289167825
55210 9.6533299697771 9.27631752311657 7.31638910733638 7.681617596219
339453 3.60650911302411 3.85005276841447 7.74897480782607 7.87728191846019
29101 10.2615428521103 10.1938553629246 10.0558746934811 9.87552146059575
643988 8.32532736048005 8.61676570597473 8.20636632000855 7.84885351013367
142678 9.62393618478229 9.39093420067151 10.0828137653391 9.97752117409815
984 8.89191133188636 8.53187680838822 8.02070391521227 8.23672080824705
728661 9.41672368922095 9.49711139022754 9.71685060477043 9.87199408710043
728642 8.53987191999382 8.13133887880449 7.98306625784891 7.74469627884238
100506482 7.44781136700505 7.4215947533733 8.37083689135876 8.2584977506423
9906 6.21794382510645 5.88958113260111 6.30626489006267 6.30632627573292
65220 9.79633367190412 9.96321611999616 8.72257180872908 8.36270874563044
2782 12.228967726517 12.2071159724042 12.8462973999353 12.8611375424331
339456 5.64603747721074 4.92805528041574 3.85695748869908 4.28526466020509
85452 3.60650911302411 3.85005276841447 6.03916081991983 5.55535382357283
2563 0.436584111581795 3.48748268902976 11.4562981734165 11.3993131634549
5590 7.71270851685603 7.86003685698709 10.7758207259737 10.7243010580913
100506504 5.56586712852676 5.63432407735903 7.64033354782225 7.33735317135354
199990 8.43093754844065 8.27258480424024 7.48743464867261 7.5106847742391
100128003 7.15082962924792 6.82987488647684 5.64837086688767 5.92572227351795
6497 10.1422164989432 10.1168393091094 11.6548683784876 11.7700283317484
79906 5.29456510670937 5.53810876209973 7.25454985351852 7.43620655827983
100129534 5.08044030135652 4.58701836258068 6.49636777344262 6.26389100941252
11079 10.1931404341059 10.1251426128406 9.61389864963053 9.51747393990174
5192 7.40236839624388 8.02885592120323 7.21180020605918 7.56412403320056
9651 4.96014606763881 4.58701836258068 10.2532329250465 10.2104263578198
55229 8.65090323238256 8.33117945815109 8.81115379908596 8.62825437062028
388585 0.436584111581795 0.680127766972157 4.40444528400158 4.10469241456327
115110 5.92843720791147 5.43501526913563 0.156517770557991 2.51972991384211
100133445 5.39078042196867 5.07244518975092 2.9638726926156 2.51972991384211
8764 8.98733089696504 9.0420715407074 6.72637337888894 6.46458835964965
127281 8.85864887775461 8.70803376354204 8.98624050564405 9.11964275602924
440556 2.02154661230295 0.680127766972157 7.09903227589723 6.96598614373167
63976 5.08044030135652 4.58701836258068 7.0268824901414 7.01798078136993
27237 7.92039988884605 7.86003685698709 5.28580078750296 5.2421959383132
1953 9.405250904777 9.4249616044717 6.81472925330979 6.57284125030167
127262 8.52934125250165 8.84503469364785 10.6947067026104 10.7837032686455
49856 8.44220866077567 8.58099457495291 7.52283998480381 7.209029074378
7161 6.10900945355329 5.72452188633061 6.08725510812088 6.46458835964965
57212 8.641155255831 8.71904675626446 8.76014411554418 8.62825437062028
388588 6.10900945355329 5.63432407735903 3.3264427720003 2.51972991384211
57470 9.7010267118084 9.67730724790978 9.65037321979881 9.76575789437021
9731 8.78973093707988 8.78341557538418 8.59530962313625 8.89476934518903
1677 6.66540280207768 6.562770816334 6.56590870669569 6.53765182183937
339448 8.33745091956254 8.28745808072177 7.0268824901414 7.25308425445594
100133612 4.13702382972289 4.38056748511325 1.74148027127915 1.7827643196759
55966 7.95228394986584 8.11475599460888 7.29606912295679 6.96598614373167
261734 8.4755031008741 8.31675238751581 7.90471062014745 7.69764770603795
8514 9.47824326321901 9.3701257383916 11.7096670461565 11.5852806847971
26038 5.08044030135652 4.38056748511325 11.3604774737975 11.2792852570484
6146 11.5006531919673 11.5604765751282 10.6046340759675 10.7780603864545
388591 7.65575263204396 8.01104464508678 7.6563636576412 7.18648650572691
23463 10.4713830741591 10.3113048226761 9.90638719795484 9.80698055709673
387509 7.69397195427445 7.86003685698709 6.87076328822411 6.99221768530485
11332 9.91028986120121 9.79126343720687 10.3913352016753 10.5365382015287
"""

design = pd.read_table(StringIO(design_str), sep=" ")
logcounts = pd.read_table(StringIO(logcounts_str), sep=" ")

在这些数据上使用R lm.fit(design, logcounts.T)可以得到结果(这里只显示头部和尾部):

          (Intercept) celltypeB
653635           10.6     -0.36
729737            7.7     -1.84
100507658         5.0     -0.68
100132287         7.6     -1.81
100131754        15.0      1.63
100133331         8.1     -1.25
       (Intercept) celltypeB
26038          4.7      6.59
6146          11.5     -0.84
388591         7.8     -0.41
23463         10.4     -0.53
387509         7.8     -0.85
11332          9.9      0.61

如何使用Python statsmodels获得相同的结果?

聚苯乙烯。这对于sklearn来说是微不足道的,但是sklearn并没有给我遗留物和我需要的其他东西:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(design, logcounts)
print(clf.coef_[0:3,1])
print(clf.coef_[-3:,1])
print(clf.intercept_)

PSPS。我是一名非统计学家,所以在这里可能会有完全琐碎的错误/误解;我只是将一些R代码移植到Python。

1 个答案:

答案 0 :(得分:1)

相关错误是因为两个Dataframes / Series没有相同的行索引。

我们可以为design分配相同的索引(虽然我不知道这是否是在熊猫中执行此操作的官方方式)

>>> design.index = logcounts.T.index
>>> design
         (Intercept)  celltypeB
A_1.bam            1          0
A_2.bam            1          0
B_1.bam            1          1
B_2.bam            1          1

主要问题是statsmodels OLS仅适用于单变量因变量(endog)。参数估计的协方差矩阵的计算失败,因为完整矩阵将是三维的(或具有适当整形的块对角线)。

然而,有些事情是因为默认的numpy广播(但没有进行单元测试而且没有得到官方支持)

>>> res = sm.OLS(logcounts.T, design).fit()
>>> #print(res.summary())  #raises exception

>>> res.params.T.head()
   (Intercept)  celltypeB
0    10.610350  -0.356553
1     7.737029  -1.841253
2     5.004248  -0.677484
3     7.598890  -1.808223
4    14.998712   1.625381

>>> res.resid.T.head()
            A_1.bam   A_2.bam   B_1.bam   B_2.bam
653635     0.009869 -0.009869  0.002069 -0.002069
729737    -0.062040  0.062040  0.143385 -0.143385
100507658  0.076193 -0.076193 -0.263355  0.263355
100132287  0.150577 -0.150577 -0.079560  0.079560
100131754 -0.048216  0.048216 -0.008320  0.008320

>>> res.fittedvalues.T.head()
             A_1.bam    A_2.bam    B_1.bam    B_2.bam
653635     10.610350  10.610350  10.253797  10.253797
729737      7.737029   7.737029   5.895776   5.895776
100507658   5.004248   5.004248   4.326764   4.326764
100132287   7.598890   7.598890   5.790667   5.790667
100131754  14.998712  14.998712  16.624094  16.624094

>>> logcounts.head()
             A_1.bam    A_2.bam    B_1.bam    B_2.bam
653635     10.620219  10.600481  10.255866  10.251728
729737      7.674989   7.799069   6.039161   5.752391
100507658   5.080440   4.928055   4.063408   4.590119
100132287   7.749467   7.448312   5.711107   5.870227
100131754  14.950497  15.046928  16.615773  16.632414

如果因变量是多变量的,那么大多数其他结果将不起作用,并且要获得当前所有这些变量的完整结果。