r-squared

R-Squared, Adjusted R-Squared

Python
# R-Squared Intuition
# interesting parameter
# we've talked about simple linear regression:
# constructed by the ordinary least square method

"""
$
^
|            +
|            |     /
|         +  |  /  +
|         |  /  |
|         /     +
|      /
|   /  | 
|   +  +
 ----------------------> experience

+ Yi (predicted value)
|
|
/ Y^ (real value)
   i

#  SUM(Yi-Yi^)^2 -> min

the line that has the smallest sum
will be the best fitting line or will be the 
simple linear regression model

# this value's called Sum of Squared of RESiduals

                         2
# SS    = SUM( Y  -  Y^ )
    res         i     i



Now instead of drawing the regression line,
we'll draw the AVG line (Yavg)

^
|            +
|        +  
|_____+______________ Yavg
|  +
--------------------->

if we project our values on this new line

^
|            +
|        +   |
|_____+__|___|_______ Yavg
|  +
--------------------->
now we do basically the same, but it's called the Total Sum of Squares
                          2
SS    = SUM ( Y  -  Y    )
  tot          i     avg

and what R-Squared is, is:

   2        SSres
  R  = 1 - -------
            SStot

is saying how well your model is fitting the data
and how well is doing in respect to the avg.

in an ideal scenario you hope that SSres is going to 0
so R^2 will be equal to 1. The closest to 1, the better.
If R^2 is negative, this means that the module is doing really bad.
            
"""

"""
# Adjusted R^2 Intuition

we talked about R^2 for a simple linear regression
the same coincept applies for a multiple linear regression
(R^2 would be the same formula)
meaning the ordinary least square method is used (SSres -> min)
[The best fitting multiple linear regression is the one who has the
least SUM of squares of residuals]

we use R^2 as "goodness" of fit (greater is better, the closer to 1, the better)
PROBLEM: starts occurring when you add more vars to your model.

after adding a new var that can increase the accuracy of the predictions based on what you want,
you see the R^2 value to see if the new var helps

because of its formula, and SSres->min, R^2 will never decrease.

when we add a new var, the model will find SSres->min
so you'll have SSres less than you'd have before, but SStot won't change because
it's the avg of the data already there.
So you'll end up having R^2 increase.
Never it will decrease as you add variables.

THE PROBLEM IS: with R^2 you can't say if vars are helping your model or not.

Here comes in handy ADJUSTED R^2

     2              2     n-1
Adj R  = 1 - ( 1 - R  ) -------
                         n-p-1

p - number of regressors (independent vars)
n - sample size

has a penalizator factor:
  adding indep vars that don't help your model

when p increases, the denom decreases so that ratio increases
as the ratio increases, the multiplication increases
as the multiplication increases, [1 - (the multipl)] decreases
so as you add more regressors the adj R^2 is going far from 1.

also when R^2 increases, 1-R^2 decreases
causing the whole to increase

Adj R^2 helps figuring out if moddels are robust.

"""

Tricks Evaluating Model Performances

Tricks models become more robust. if you remember backward regression you take off the pvalues > 0.05 or other levels but sometimes you want to understand if you'd better keep let's say a var that has a 0.06 as p-value (slightly different).

There is a way.

Watching R-Squared and Adj R-Squared R^2: how well your model has been fitted - can never be greater than 1 - you want it to bee as closest to 1.

R^2: the more vars you add, the more it will increase even if you throw random vars that have nothing to do with the prediction Adj-R^2: same as R^2, but penalization factor: the more you add vars, the more it will reduce we will use adj r^2 to see how well the model is fitted.

Observe the following img (upper left is the beginning, bottom right is the end)

Machine Learning Course on Udemy

The third one is the best. We're predicting Profit based on:

Python
                 Estimate  Std. Error t value  Pr(<|t|)
(Intercept)	4.698e+04  2.690e+03   17.464    <2e-16 ***
R.D.Spend	7.966e-01  4.135e-02   19.266    <2e-16 ***
Marketing.Spend 2.991e-02  1.552e-02    1.927      0.06 .

                    ^
this coefficients here (b1,b2 -- r.d. spend and marketing.spend)
if the sign is positive, means your indep var (rdspend, marketingspend)
are correlated to your dep var (profit)
if it is negative, they aren't correlated!

Let's see the MAGNITUDE (Estimate) always tricky with regressions:

here you could say that rdspend has a greater impact than marketingspend.

This might be based let's say if mrkspend is calculated in cents and rdspend in dollars. So it's better to say that: rd-spend has a greater impact on profit per-unit of rdspend than marketing-spend has per-unit of marketing-spend. 

This leads us to the actual interpretation of this coefficients (or variables).

In this case this means that with this module, if you keep everything constant but adjust rd-spend, for every unit of rd-spend that you increase, your profit will increase by 7.966e-01 units of profit (7.966e-01 dollars -> 0.7966 cents)

r-squared statistics