A regression is a method to calculate the relationships between a dependent variable (Y) and independent variables (X_{i}).

You may use the linear regression when having a linear relationship between the dependent variable (X) and the independent varaibe (Y). When adding one unit to X then Y will be changed by a constant value, the **a** coefficient.

H_{0}: Y = b_{0}

H_{1}: Y = b_{0} + b_{1}X

The least squares method is used to calculate the coefficients b and a. The mothod choose the line that will minimize the sum of the square length of the real values (Y_{i} from the linear line.

$$Min(\sum_{i=1 }^{n}(\hat y_i-y_i)^2)$$
$$b_1=\frac{\sum_{1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{\sum_{1}^{n}(x_i-\bar{x})^2}\\
b_0=\bar{y}-b_1\bar{x}$$
R^{2} is the ratio of the T variance explain by X with the regression Y

R is the correlation between X and Y

$$R=a*\frac{var(x)}{var(y)}$$

When having more than one dependent varaible, the multiple regression will compare the following hyposesis, using the F statistic:

H_{0}: Y = b_{0}

H_{1}: Y = b_{0}+b_{1}X_{1}+...+b_{p}X_{p}

This is an interative process as you should also check independently each coefficient fo the following hyposesis:

H_{0}: b_{i} = 0

H_{1}: b_{i} ≠ 0

Each time you should remove only the one most insignificant variable (p-value > α) changing the **include** value from **√** to **χ**

After removing one insignifcant variable other insignifican variable may become significan in the new model.

**Linearity**- a linear relationship between the dependent variable, Y and the independent variables, X_{i}**Residual normality**- the tool will run the Shapiro-Wilk test per each variable, but for the regression the**only**normality assumption is regard the residual.**Homoscedasticity**, homogeneity of variance - the variance of the residuals is constand and doesn't depend on the independent variables X_{i}**Variables**- The dependent varaible, Y, should be continuous varaible while the independent varaibles, X_{i}, should be continuous varaibles or ordinal variables (ordinal example: low, medium, high)**No Perfect Correlation**(Multicollinearity) - between two or more independent varaibles, X_{i}.**Independent observations**

- Too many independent variables (Xi) or too small sample size.
**Solution**: Reduce the number of the independent variables or increase the sample size. - Multicollinearity, two independent variables (Xi) has a perfect correlation (1).
**Solution**: Remove one of the variables.

Test for homoscedasticity, homogeneity of variance using the following hyposesis

$$ H_0: \hat\varepsilon_i^2=b_0\\
H_1: \hat\varepsilon_i^2=b_0+b_1\hat Y_i+b_2\hat Y_i^2 $$
While the ε is the residual and Ŷ is the predicated Y, the test will run a **second** regression with the following variables:

Independet variable: Y' = ε^{2}.

Dependent variabels: X'_{1}=Ŷ, X'_{2}=Ŷ ^{2}.

The tool uses the F statistic which is the result of the **second** regression. other option is to use the following statistic: χ^{2}=nR'^{2} while n is the sample size and R'^{2} is the result of the second regression.

- Try to transporm the dependent varaibles X
_{i}, square root for count variable, log for skew variable and other - You may be missing an independent variable or combination (x
_{i}or x_{i}x_{j}or x_{i}^{2}) - Weighted regression

Caclualte the regression's parameters without matrixs is very complex, but it is very easy with the matrix calculation.

p - number of independent variables.

n - sample size.

Y - dependent variable vector (n x 1).
$$\hat Y (predicted \space Y) \space vector (n x 1).$$
X - independent matrix (n x p+1).
Ε - Residuals vector (n x 1).
B - Coefficient vector (p+1 x 1)
$$Y=\begin{bmatrix}
&Y_1\\
&Y_2\\
& :\\
&Y_n
\end{bmatrix}
\hat Y=\begin{bmatrix}
& \hat Y_1\\
& \hat Y_2\\
& :\\
& \hat Y_n
\end{bmatrix}
X=\begin{bmatrix}
&1 &X_{11} &X_{12} & .. &X_{1p} \\
&1 &X_{21} &X_{22} & .. &X_{2p} \\
& : & : & : & : & : \\
&1 &X_{n1} &X_{n2} & .. &X_{np}
\end{bmatrix}
Ε=\begin{bmatrix}
& \varepsilon_1\\
& \varepsilon_2\\
& :\\
& \varepsilon_n
\end{bmatrix}
B=\begin{bmatrix}
&b_0\\
&b_1\\
&b_2\\
& :\\
&b_p
\end{bmatrix}\\$$
**Y = XB + Ε**, is equvalent to the following equation: **Y = b _{0} + b_{1}X_{1} + b_{2}X_{2}+...+b_{p}X_{p}+ε**

$$ B = (X'X)^{-1}X'Y\\ \hat Y=XB\\ Ε=Y-\hat Y$$

X1 | X2 | Y |
---|---|---|

1 | 1 | 2.1 |

2 | 2 | 3.9 |

3 | 3 | 6.3 |

4 | 1 | 4.95 |

5 | 2 | 7.1 |

6 | 3 | 8.5 |

__Following the data as a matrix structure.__

$$Y=\begin{bmatrix}
&2.1\\
&3.9\\
&6.3\\
&4.95\\
&7.1\\
&8.5\\
\end{bmatrix} \quad
X=\begin{bmatrix}
&1 &1 &1 \\
&1 &2 &2 \\
&1 &3 &3 \\
&1 &4 &1 \\
&1 &5 &2 \\
&1 &6 &3
\end{bmatrix}
$$
The first column of the X matrix contains only the value 1 for the b intercept.
$$
B = (X'X)^{-1}X'Y\\\\
X'=\begin{bmatrix}
&1 &1 &1 &1 &1 &1\\
&1 &2 &3 &4 &5 &6\\
&1 &2 &3 &1 &2 &3
\end{bmatrix} \quad
XX'=\begin{bmatrix}
&6 &21 &12\\
&21 &91 &46\\
&12 &46 &28
\end{bmatrix} \quad
(X'X)^{-1}=\begin{bmatrix}
& 4/3 & -1/9 & -7/8\\
& -1/9 & 2/27 & -2/27\\
& -7/18 & -2/27 &35/108
\end{bmatrix} $$
$$ H=(X'X)^{-1}X'=\begin{bmatrix}
&25/72 & -23/36 & -13/8 &1/72 & -35/36 & -47/24\\
& -1/9 & -1/9 & -1/9 &1/9 &1/9 &1/9\\
& -5/8 & -3/8 & -1/8 & -61/72 & -43/72 & -25/72
\end{bmatrix}$$
$$B=HY=\begin{bmatrix}
&0.2250\\
&0.9167\\
&1.0208
\end{bmatrix} \quad
\hat Y=XB=\begin{bmatrix}
&2.1625\\
&4.1\\
&6.0375\\
&4.9125\\
&6.85\\
&8.7875\\
\end{bmatrix} \quad
Ε=\begin{bmatrix}
& -0.0625\\
& -0.2\\
& 0.2625\\
& 0.0375\\
& 0.25\\
& -0.2875\\
\end{bmatrix}
$$

SS | DF | MS | |
---|---|---|---|

Total (T) | 26.6187 | 2 | 13.1797 |

Residual (E) | 0.2594 | 3 | 0.08646 |

Regression (R) | 26.6187 | 5 | 5.3237 |

R^{2} = 0.9903

F = 152.4398
$$
Covariance(B)=\begin{bmatrix}
& 0.1153 & -0.0096 & -0.0336\\
& -0.0096 & 0.0064 & -0.0064\\
& -0.0336 & -0.0064 & 0.0280
\end{bmatrix} \quad
Var(B)=\begin{bmatrix}
&0.1153\\
&0.0064\\
&1.0280
\end{bmatrix} \quad
T=\begin{bmatrix}
&0.6627\\
&11.4545\\
&6.0986
\end{bmatrix} \quad
$$

When the dependent varaible is a **binary variable**, also called **dichoyomous variable**, you should use the Logistic Regression. The model will calclulate the probability for the category to occure based on the independent varaibles, X_{j}.

The dependent varaible Y may have only two options 1 or 0, for example win or lose, succees or failure etc.
The following model required the **accumulated** input data, based on i combination of X: (x_{1}..x_{p}), there is also a similar commonly used model is based on single events,
therefore each data row is a single event. In this case, every row is a single event and Y_{i} may be only 1 or 0.

**y(1)**: the total 1 occurances based for i combination of X._{i}**y(0)**: the total 0 occurances based for i combination of X._{i}**t**: the total events for i combination of X, t_{i}_{i}=y(1)_{i}+y(0)_{i}**p**: the observed probablilty for event=1._{i}**p̂**: the predicted probablilty for event=1 based on the model._{i}

**Odds** is the ratio between the probablity that the event will happend to the probability it won't happend

The odds is actualy similar to the probablity but from a differnt angel.
$$odds=\frac{p}{1-p}$$
__Examples__

When P = 1/3 the odds are 1:2 (odds = 0.5).

When P = 1/2 the odds are 1:1 (odds = 1).

**H _{0}: ln(odds) = b_{0}**

$$ odds(x_1..x_p)=\frac{p(x_1..x_p)}{1-p(x_1..x_p)}=e^{b_0+b_1x_1+...+b_px_p}\quad\Rightarrow\quad p(x_1..x_p)=\frac{1}{1+e^{-(b_0+b_1x_1+...+b_px_p)}}$$ The maximize log-likelihood method is used instead of the least squares method that is used in the linear regression. The method is based on the Binomial distribution.

Likelihood is the posibility that the sample data will occur and the maximize log-likelihood method finds the P

We use the newton's method to find the B vector of parameters that will maximize the Log-likelihood function based on the following iteration formula.
The iteration loop will run until the differences between B_{R+1} and B_{R} will limit to zero for each coefficient element.
B_{0}=[0,0,..0]

$$ V=\begin{bmatrix}
t_1 \hat p_1(1- \hat p_1) & 0 & 0 & 0\\
0 & t_2 \hat p_2(1- \hat p_2) & 0 & 0 \\
0 & 0 & ...& 0\\
0 & 0 & 0 & t_n \hat p_n(1- \hat p_n)
\end{bmatrix}$$
R: interation.

T: t_{i} vector.

P: p_{i} vector. (observed probabiliies)

p̂: p̂_{i} vector (predicted probabiliies).

B: b_{i} vector. (coefficients)

** $$ B_{R+1}=B_R+(X'V_RX)^{-1}X'T⊙(P- \hat P_R)$$ **
All the multipications are matrix multipications, except for the last which is an element wise multipication . Example: [2,3] ⊙ [4,5] =[8,15].