I'm able to use the statsmodel's WLS (weighted least squares regression) fine when I have lots of datapoints. However, I seem to be having a problem with the numpy arrays when I try to use WLS for a single sample from the dataset.
What I mean is, if I have a dataset X which is a 2D array, with lots of rows, WLS works fine. But not if I try to work it on a single row. You'll get what I mean in the code below:
import sys from sklearn.externals.six.moves import xrange from sklearn.metrics import accuracy_score import pylab as pl from sklearn.externals.six.moves import zip import numpy as np import statsmodels.api as sm from statsmodels.sandbox.regression.predstd import wls_prediction_std # this is my dataset X, with 10 rows X = np.array([[1,2,3],[1,2,3],[4,5,6],[1,2,3],[4,5,6],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[1,2,3]]) # this is my response vector, y, also with 10 rows y = np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1]) # weights, 10 rows weights = np.array([ 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1 ]) # the line below, using all 10 rows of X, gives no errors but is commented out # mod_wls = sm.WLS(y, X, weights) # and this is the line I need, which is giving errors: mod_wls = sm.WLS(np.array(y), np.array([X]),np.array([weights]))
The last line above was initially just
mod_wls = sm.WLS(y, X, weights)
But that gave me errors like
object of type 'numpy.float64' has no len(), hence I turned them into arrays.
But now I keep getting this error:
Traceback (most recent call last): File "C:\Users\app\Documents\Python Scripts\test.py", line 53, in <module> mod_wls = sm.WLS(np.array(y), np.array([X]),np.array([weights])) File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 383, in __init__ weights=weights, hasconst=hasconst) File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 79, in __init__ super(RegressionModel, self).__init__(endog, exog, **kwargs) File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\model.py", line 136, in __init__ super(LikelihoodModel, self).__init__(endog, exog, **kwargs) File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\model.py", line 52, in __init__ self.data = handle_data(endog, exog, missing, hasconst, **kwargs) File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 401, in handle_data return klass(endog, exog=exog, missing=missing, hasconst=hasconst, **kwargs) File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 78, in __init__ self._check_integrity() File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 249, in _check_integrity print len(self.endog) TypeError: len() of unsized object
So in order to see what was wrong with the lengths, I did this:
print "y size: " print len(np.array([y])) print "X size" print len (np.array([X])) print "weights size" print len(np.array([weights]))
And got this output:
y size: 1 X size 1 weights size 1
I then tried this:
print "x shape" print X.shape print "y shape" print y.shape
And the output was:
x shape (3L,) y shape ()
Line 249 in data.py, which the error referred to, has this function, where I added a bunch of "print sizes" in order to see what was happening:
def _check_integrity(self): if self.exog is not None: print "exog size: " print len(self.exog) print "endog size" print len(self.endog) # <-- this, and the line below are causing the error if len(self.exog) != len(self.endog): raise ValueError("endog and exog matrices are different sizes")
It appears there's something wrong with
len(self.endog). Although when I tried printing out
len(np.array([y])), it simply gave the output
1. But somehow when
y goes into the check_integrity function and becomes
endog, it doesn't behave the same.... or is something else going on?
What should I do? I'm using an algorithm where I really do need to run WLS for each row of
There's no such thing as WLS for one observation. The single weight would simply become 1 when they're normalized to sum to 1. If you want to do this, though I supsect you don't, just use OLS. The solution will be a consequence of the SVD not any actual relationship in the data though.
OLS solution using pinv/svd
Though you could just make up any answer that works and get the same result. I'm not sure offhand what exactly the properties of the SVD solution are vs. the other non-unique solutions.
[~/] : beta = [-.5, .25, 1/3.] [~/] : np.dot(beta, X) : 1.0