R-Squared, Adjusted R-Squared
# R-Squared Intuition # interesting parameter # we've talked about simple linear regression: # constructed by the ordinary least square method """ $ ^ | + | | / | + | / + | | / | | / + | / | / | | + + ----------------------> experience + Yi (predicted value) | | / Y^ (real value) i # SUM(Yi-Yi^)^2 -> min the line that has the smallest sum will be the best fitting line or will be the simple linear regression model # this value's called Sum of Squared of RESiduals 2 # SS = SUM( Y - Y^ ) res i i Now instead of drawing the regression line, we'll draw the AVG line (Yavg) ^ | + | + |_____+______________ Yavg | + ---------------------> if we project our values on this new line ^ | + | + | |_____+__|___|_______ Yavg | + ---------------------> now we do basically the same, but it's called the Total Sum of Squares 2 SS = SUM ( Y - Y ) tot i avg and what R-Squared is, is: 2 SSres R = 1 - ------- SStot is saying how well your model is fitting the data and how well is doing in respect to the avg. in an ideal scenario you hope that SSres is going to 0 so R^2 will be equal to 1. The closest to 1, the better. If R^2 is negative, this means that the module is doing really bad. """ """ # Adjusted R^2 Intuition we talked about R^2 for a simple linear regression the same coincept applies for a multiple linear regression (R^2 would be the same formula) meaning the ordinary least square method is used (SSres -> min) [The best fitting multiple linear regression is the one who has the least SUM of squares of residuals] we use R^2 as "goodness" of fit (greater is better, the closer to 1, the better) PROBLEM: starts occurring when you add more vars to your model. after adding a new var that can increase the accuracy of the predictions based on what you want, you see the R^2 value to see if the new var helps because of its formula, and SSres->min, R^2 will never decrease. when we add a new var, the model will find SSres->min so you'll have SSres less than you'd have before, but SStot won't change because it's the avg of the data already there. So you'll end up having R^2 increase. Never it will decrease as you add variables. THE PROBLEM IS: with R^2 you can't say if vars are helping your model or not. Here comes in handy ADJUSTED R^2 2 2 n-1 Adj R = 1 - ( 1 - R ) ------- n-p-1 p - number of regressors (independent vars) n - sample size has a penalizator factor: adding indep vars that don't help your model when p increases, the denom decreases so that ratio increases as the ratio increases, the multiplication increases as the multiplication increases, [1 - (the multipl)] decreases so as you add more regressors the adj R^2 is going far from 1. also when R^2 increases, 1-R^2 decreases causing the whole to increase Adj R^2 helps figuring out if moddels are robust. """
Tricks Evaluating Model Performances
Tricks models become more robust.
if you remember backward regression
you take off the pvalues > 0.05 or other levels
but sometimes you want to understand if you'd
better keep let's say a var that has a 0.06 as p-value (slightly different).
There is a way.
Watching R-Squared and Adj R-Squared R^2: how well your model has been fitted - can never be greater than 1 - you want it to bee as closest to 1.
R^2: the more vars you add, the more it will increase even if you throw random vars that have nothing to do with the prediction Adj-R^2: same as R^2, but penalization factor: the more you add vars, the more it will reduce we will use adj r^2 to see how well the model is fitted.
Observe the following img (upper left is the beginning, bottom right is the end)
The third one is the best. We're predicting Profit based on:
Estimate Std. Error t value Pr(<|t|) (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 *** R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 *** Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 . ^ this coefficients here (b1,b2 -- r.d. spend and marketing.spend) if the sign is positive, means your indep var (rdspend, marketingspend) are correlated to your dep var (profit) if it is negative, they aren't correlated!
Let's see the MAGNITUDE (Estimate)
always tricky with regressions:
here you could say that rdspend has a greater impact than marketingspend.
This might be based let's say if mrkspend is calculated in cents and rdspend in dollars. So it's better to say that: rd-spend has a greater impact on profit per-unit of rdspend than marketing-spend has per-unit of marketing-spend.
This leads us to the actual interpretation of this coefficients (or variables).
In this case this means that with this module, if you keep everything constant but adjust rd-spend, for every unit of rd-spend that you increase, your profit will increase by 7.966e-01 units of profit (7.966e-01 dollars -> 0.7966 cents)