A regression is a method to calculate the relationships between a dependent variable (Y) and independent variables (X_{i}).

You may use the linear regression when having a linear relationship between the dependent variable (X) and the independent varaibe (Y). When adding one unit to X then Y will be changed by a constant value, the **a** coefficient.

H_{0}: Y = b_{0}

H_{1}: Y = b_{0} + b_{1}X

The least squares method is used to calculate the coefficients b and a. The mothod choose the line that will minimize the sum of the square length of the real values (Y_{i} from the linear line.

$$Min(\sum_{i=1 }^{n}(\hat y_i-y_i)^2)$$
$$b_1=\frac{\sum_{1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{\sum_{1}^{n}(x_i-\bar{x})^2}\\
b_0=\bar{y}-b_1\bar{x}$$
R^{2} is the ratio of the T variance explain by X with the regression Y

R is the correlation between X and Y

$$R=a*\frac{var(x)}{var(y)}$$

When having more than one dependent varaible, the multiple regression will compare the following hyposesis, using the F statistic:

H_{0}: Y = b_{0}

H_{1}: Y = b_{0}+b_{1}X_{1}+...+b_{p}X_{p}

This is an interative process as you should also check independently each coefficient fo the following hyposesis:

H_{0}: b_{i} = 0

H_{1}: b_{i} ≠ 0

Each time you should remove only the one most insignificant variable (p-value > α) changing the **include** value from **√** to **χ**

After removing one insignifcant variable other insignifican variable may become significan in the new model.

**Linearity**- a linear relationship between the dependent variable, Y and the independent variables, X_{i}**Residual normality**- the tool will run the Shapiro-Wilk test per each variable, but for the regression the**only**normality assumption is regard the residual.**Homoscedasticity**, homogeneity of variance - the variance of the residuals is constand and doesn't depend on the independent variables X_{i}**Variables**- The dependent varaible, Y, should be continuous varaible while the independent varaibles, X_{i}, should be continuous varaibles or ordinal variables (ordinal example: low, medium, high)**No Perfect Correlation**(Multicollinearity) - between two or more independent varaibles, X_{i}.**Independent observations**

- Too many independent variables (Xi) or too small sample size.
**Solution**: Reduce the number of the independent variables or increase the sample size. - Multicollinearity, two independent variables (Xi) has a perfect correlation (1).
**Solution**: Remove one of the variables.

Test for homoscedasticity, homogeneity of variance using the following hyposesis

$$ H_0: \hat\varepsilon_i^2=b_0\\
H_1: \hat\varepsilon_i^2=b_0+b_1\hat Y_i+b_2\hat Y_i^2 $$
While the ε is the residual and Ŷ is the predicated Y, the test will run a **second** regression with the following variables:

Independet variable: Y' = ε^{2}.

Dependent variabels: X'_{1}=Ŷ, X'_{2}=Ŷ ^{2}.

The tool uses the F statistic which is the result of the **second** regression. other option is to use the following statistic: χ^{2}=nR'^{2} while n is the sample size and R'^{2} is the result of the second regression.

- Try to transporm the dependent varaibles X
_{i}, square root for count variable, log for skew variable and other - You may be missing an independent variable or combination (x
_{i}or x_{i}x_{j}or x_{i}^{2}) - Weighted regression

Caclualte the regression's parameters without matrixs is very complex, but it is very easy with the matrix calculation.

p - number of independent variables.

n - sample size.

Y - dependent variable vector (n x 1).
$$\hat Y (predicted \space Y) \space vector (n x 1).$$
X - independent matrix (n x p+1).
Ε - Residuals vector (n x 1).
B - Coefficient vector (p+1 x 1)
$$Y=\begin{bmatrix}
&Y_1\\
&Y_2\\
& :\\
&Y_n
\end{bmatrix}
\hat Y=\begin{bmatrix}
& \hat Y_1\\
& \hat Y_2\\
& :\\
& \hat Y_n
\end{bmatrix}
X=\begin{bmatrix}
&1 &X_{11} &X_{12} & .. &X_{1p} \\
&1 &X_{21} &X_{22} & .. &X_{2p} \\
& : & : & : & : & : \\
&1 &X_{n1} &X_{n2} & .. &X_{np}
\end{bmatrix}
Ε=\begin{bmatrix}
& \varepsilon_1\\
& \varepsilon_2\\
& :\\
& \varepsilon_n
\end{bmatrix}
B=\begin{bmatrix}
&b_0\\
&b_1\\
&b_2\\
& :\\
&b_p
\end{bmatrix}\\$$
**Y = XB + Ε**, is equvalent to the following equation: **Y = b _{0} + b_{1}X_{1} + b_{2}X_{2}+...+b_{p}X_{p}+ε**

$$ B = (X'X)^{-1}X'Y\\ \hat Y=XB\\ Ε=Y-\hat Y$$

X1 | X2 | Y |
---|---|---|

1 | 1 | 2.1 |

2 | 2 | 3.9 |

3 | 3 | 6.3 |

4 | 1 | 4.95 |

5 | 2 | 7.1 |

6 | 3 | 8.5 |

__Following the data as a matrix structure.__

$$Y=\begin{bmatrix}
&2.1\\
&3.9\\
&6.3\\
&4.95\\
&7.1\\
&8.5\\
\end{bmatrix} \quad
X=\begin{bmatrix}
&1 &1 &1 \\
&1 &2 &2 \\
&1 &3 &3 \\
&1 &4 &1 \\
&1 &5 &2 \\
&1 &6 &3
\end{bmatrix}
$$
The first column of the X matrix contains only the value 1 for the b intercept.
$$
B = (X'X)^{-1}X'Y\\\\
X'=\begin{bmatrix}
&1 &1 &1 &1 &1 &1\\
&1 &2 &3 &4 &5 &6\\
&1 &2 &3 &1 &2 &3
\end{bmatrix} \quad
XX'=\begin{bmatrix}
&6 &21 &12\\
&21 &91 &46\\
&12 &46 &28
\end{bmatrix} \quad
(X'X)^{-1}=\begin{bmatrix}
& 4/3 & -1/9 & -7/8\\
& -1/9 & 2/27 & -2/27\\
& -7/18 & -2/27 &35/108
\end{bmatrix} $$
$$ H=(X'X)^{-1}X'=\begin{bmatrix}
&25/72 & -23/36 & -13/8 &1/72 & -35/36 & -47/24\\
& -1/9 & -1/9 & -1/9 &1/9 &1/9 &1/9\\
& -5/8 & -3/8 & -1/8 & -61/72 & -43/72 & -25/72
\end{bmatrix}$$
$$B=HY=\begin{bmatrix}
&0.2250\\
&0.9167\\
&1.0208
\end{bmatrix} \quad
\hat Y=XB=\begin{bmatrix}
&2.1625\\
&4.1\\
&6.0375\\
&4.9125\\
&6.85\\
&8.7875\\
\end{bmatrix} \quad
Ε=\begin{bmatrix}
& -0.0625\\
& -0.2\\
& 0.2625\\
& 0.0375\\
& 0.25\\
& -0.2875\\
\end{bmatrix}
$$

SS | DF | MS | |
---|---|---|---|

Total (T) | 26.6187 | 2 | 13.1797 |

Residual (E) | 0.2594 | 3 | 0.08646 |

Regression (R) | 26.6187 | 5 | 5.3237 |

R^{2} = 0.9903

F = 152.4398
$$
Covariance(B)=\begin{bmatrix}
& 0.1153 & -0.0096 & -0.0336\\
& -0.0096 & 0.0064 & -0.0064\\
& -0.0336 & -0.0064 & 0.0280
\end{bmatrix} \quad
Var(B)=\begin{bmatrix}
&0.1153\\
&0.0064\\
&1.0280
\end{bmatrix} \quad
T=\begin{bmatrix}
&0.6627\\
&11.4545\\
&6.0986
\end{bmatrix} \quad
$$

When the dependent varaible is a **binary variable**, also called **dichoyomous variable**, you should use the Logistic Regression. The model will calclulate the probability for the category to occure based on the independent varaibles, X_{j}.

The dependent varaible Y may have only two options 1 or 0, for example win or lose, succees or failure etc.
The following model required the **accumulated** data based on i combination of X: (x_{1}..x_{p}), there is also a similar commonly used model is based on single events,
therefore each data row is a single event. In this case, i is a single event and Y_{i} may be only 1 or 0.

**y(1)**: the total 1 occurances based for i combination of X._{i}**y(0)**: the total 0 occurances based for i combination of X._{i}**t**: the total events for i combination of X, t_{i}_{i}=y(1)_{i}+y(0)_{i}**p**: the observed probablilty for event=1._{i}**p̂**: the predicted probablilty for event=1 based on the model._{i}

**Odds** is the ratio between the probablity that the event will happend to the probability it won't happend

The odds is actualy similar to the probablity but from a differnt angel.
$$odds=\frac{p}{1-p}$$
__Examples__

When P = 1/3 the odds are 1:2 (odds = 0.5).

When P = 1/2 the odds are 1:1 (odds = 1).

**H _{0}: ln(odds) = b_{0}**

$$ odds(x_1..x_p)=\frac{p(x_1..x_p)}{1-p(x_1..x_p)}=e^{b_0+b_1x_1+...+b_px_p}\quad\Rightarrow\quad p(x_1..x_p)=\frac{1}{1+e^{-(b_0+b_1x_1+...+b_px_p)}}$$ The maximize log-likelihood method is used instead of the method of least squares in the linear regression. Based on the Binomial distribution.

Log-likelihood is the posibility that the sample data will occur.

We use the newton's method to find the B vector of parameters that will maximize the Log-likelihood function based on the following iteration formula.
The iteration loop will run until the differences between B_{R+1} and B_{R} will limit to zero for each coefficient element.
B_{0}=[0,0,..0]

$$ V=\begin{bmatrix}
t_1 \hat p_1(1- \hat p_1) & 0 & 0 & 0\\
0 & t_2 \hat p_2(1- \hat p_2) & 0 & 0 \\
0 & 0 & ...& 0\\
0 & 0 & 0 & t_n \hat p_n(1- \hat p_n)
\end{bmatrix}$$
R: interation.

T: t_{i} vector.

P: p_{i} vector. (observed probabiliies)

p̂: p̂_{i} vector (predicted probabiliies).

B: b_{i} vector. (coefficients)

** $$ B_{R+1}=B_R+(X'V_RX)^{-1}X'T⊙(P- \hat P_R)$$ **
All the multipications are matrix multipications, except for the last which is an element wise multipication . Example: [2,3] ⊙ [4,5] =[8,15].