rsquared
RSquared, Adjusted RSquared
# RSquared Intuition
# interesting parameter
# we've talked about simple linear regression:
# constructed by the ordinary least square method
"""
$
^
 +
  /
 +  / +
  / 
 / +
 /
 / 
 + +
> experience
+ Yi (predicted value)


/ Y^ (real value)
i
# SUM(YiYi^)^2 > min
the line that has the smallest sum
will be the best fitting line or will be the
simple linear regression model
# this value's called Sum of Squared of RESiduals
2
# SS = SUM( Y  Y^ )
res i i
Now instead of drawing the regression line,
we'll draw the AVG line (Yavg)
^
 +
 +
_____+______________ Yavg
 +
>
if we project our values on this new line
^
 +
 + 
_____+____________ Yavg
 +
>
now we do basically the same, but it's called the Total Sum of Squares
2
SS = SUM ( Y  Y )
tot i avg
and what RSquared is, is:
2 SSres
R = 1  
SStot
is saying how well your model is fitting the data
and how well is doing in respect to the avg.
in an ideal scenario you hope that SSres is going to 0
so R^2 will be equal to 1. The closest to 1, the better.
If R^2 is negative, this means that the module is doing really bad.
"""
"""
# Adjusted R^2 Intuition
we talked about R^2 for a simple linear regression
the same coincept applies for a multiple linear regression
(R^2 would be the same formula)
meaning the ordinary least square method is used (SSres > min)
[The best fitting multiple linear regression is the one who has the
least SUM of squares of residuals]
we use R^2 as "goodness" of fit (greater is better, the closer to 1, the better)
PROBLEM: starts occurring when you add more vars to your model.
after adding a new var that can increase the accuracy of the predictions based on what you want,
you see the R^2 value to see if the new var helps
because of its formula, and SSres>min, R^2 will never decrease.
when we add a new var, the model will find SSres>min
so you'll have SSres less than you'd have before, but SStot won't change because
it's the avg of the data already there.
So you'll end up having R^2 increase.
Never it will decrease as you add variables.
THE PROBLEM IS: with R^2 you can't say if vars are helping your model or not.
Here comes in handy ADJUSTED R^2
2 2 n1
Adj R = 1  ( 1  R ) 
np1
p  number of regressors (independent vars)
n  sample size
has a penalizator factor:
adding indep vars that don't help your model
when p increases, the denom decreases so that ratio increases
as the ratio increases, the multiplication increases
as the multiplication increases, [1  (the multipl)] decreases
so as you add more regressors the adj R^2 is going far from 1.
also when R^2 increases, 1R^2 decreases
causing the whole to increase
Adj R^2 helps figuring out if moddels are robust.
"""
Tricks Evaluating Model Performances
Tricks models become more robust.
if you remember backward regression
you take off the pvalues > 0.05 or other levels
but sometimes you want to understand if you'd
better keep let's say a var that has a 0.06 as pvalue (slightly different).
There is a way.
Watching RSquared and Adj RSquared
R^2: how well your model has been fitted
 can never be greater than 1
 you want it to bee as closest to 1.
R^2: the more vars you add, the more it will increase
even if you throw random vars that have nothing to do with the prediction
AdjR^2: same as R^2, but penalization factor: the more you add vars, the more it will reduce
we will use adj r^2 to see how well the model is fitted.
Observe the following img (upper left is the beginning, bottom right is the end)
The third one is the best. We're predicting Profit based on:
Estimate Std. Error t value Pr(<t)
(Intercept) 4.698e+04 2.690e+03 17.464 <2e16 ***
R.D.Spend 7.966e01 4.135e02 19.266 <2e16 ***
Marketing.Spend 2.991e02 1.552e02 1.927 0.06 .
^
this coefficients here (b1,b2  r.d. spend and marketing.spend)
if the sign is positive, means your indep var (rdspend, marketingspend)
are correlated to your dep var (profit)
if it is negative, they aren't correlated!
Let's see the MAGNITUDE (Estimate)
always tricky with regressions:
here you could say that rdspend has a greater impact than marketingspend.
This might be based let's say if mrkspend is calculated in cents and rdspend in dollars.
So it's better to say that:
rdspend has a greater impact on profit perunit of rdspend than marketingspend has perunit of marketingspend.
This leads us to the actual interpretation of this coefficients (or variables).
In this case this means that with this module, if you keep everything constant but adjust rdspend, for every unit
of rdspend that you increase, your profit will increase by 7.966e01 units of profit (7.966e01 dollars > 0.7966 cents)