Logistic Regression

Video Multiple linear regression Simple linear regression Regression sample size

The logistic regression is a method to calculate the relationships between a nominal categorical dependent variable (Y) and continuous/categorical independent variables (X_i) like the independent variables in the linear regression. For example, the color dependent variable (Y) with the following value: blue, red or green.

Binary Logistic Regression (Go to the calculator)

When the dependent variable is a binary variable, also called a dichotomous variable, you should use the Binary Logistic Regression. The model will calculate the probability for the category to occur based on the independent variables, X_j.
The dependent variable Y may have only two options 1 or 0, such as the win or lose and success or failure. The following model required the accumulated input data, based on i combination of X: (x₁..x_p), there is also a similar commonly used model is based on single events, therefore each data row is a single event. In this case, every row is a single event, and Y_i value may be only 1 or 0.

y(1)_i: the total 1 occurrences based for i combination of X.
y(0)_i: the total 0 occurrences based for i combination of X.
e_i: the total events for i combination of X, e_i=y(1)_i+y(0)_i
p_i: the observed probability for event=1.
p̂_i: the predicted probability for event=1 based on the model.

Odds is the ratio between the probability that the event will happen to the probability it won't happen
The odds is actually similar to the probability but from a different angel. $$odds=\frac{p}{1-p}$$ Examples
When P = 1/3 the odds are 1:2 (odds = 0.5).
When P = 1/2 the odds are 1:1 (odds = 1).

H₀: ln(odds) = b₀
H₁: ln(odds) = b₀+b₁X₁+...+b_pX_p

$$ odds(x_1..x_p)=\frac{p(x_1..x_p)}{1-p(x_1..x_p)}=e^{b_0+b_1x_1+...+b_px_p}\\ p(x_1..x_p)=\frac{1}{1+e^{-(b_0+b_1x_1+...+b_px_p)}}$$ The maximize log-likelihood method is used instead of the least squares method that is used in the linear regression. The method is based on the Binomial distribution.

Likelihood is the probability that the sample data will occur and the maximize log-likelihood method finds the P_i that will maximize this probability.
n: The number of X combinations (x₁..x_p) $$L=\prod_{i=1}^n \hat p^{y_i(1)}(1- \hat p_i)^{y_i(0)}$$ $$LL=ln(L)=\sum_{1=1}^n{(y_i(1) ln(\hat p_i) + y_i(0)ln(1- \hat p_i))}$$

Newton's Method

We use the newton's method to find the B vector of parameters that will maximize the Log-likelihood function based on the following iteration formula. The iteration loop will run until the differences between B_R+1 and B_R will limit to zero for each coefficient element. B₀=[0,0,..0]
$$ V=\begin{bmatrix} e_1 \hat p_1(1- \hat p_1) & 0 & 0 & 0\\ 0 & e_2 \hat p_2(1- \hat p_2) & 0 & 0 \\ 0 & 0 & ...& 0\\ 0 & 0 & 0 & e_n \hat p_n(1- \hat p_n) \end{bmatrix}$$ R: iteration.
T: t_i vector.
P: p_i vector. (observed probabilities)
p̂: p̂_i vector (predicted probabilities).
B: b_i vector. (coefficients)
$$ B_{R+1}=B_R+(X'V_RX)^{-1}X'T⊙(P- \hat P_R)$$ All the multiplications are matrix multiplications, except for the last which is an element wise multiplication . Example: [2,3] ⊙ [4,5] =[8,15].

Multinomial Logistic Regression (Go to the calculator)

When the dependent variable can get more than two categorical values, you should use the Multinomial Logistic Regression. The model will calculate the probability for the category to occur based on the independent variables, X_j. The dependent variable Y may have more than two options, for example Red, Blue, Green.

You may use one of the following input data methods:
Single Y column - each data row is a single event, Y column contains one of the categoritcal values.
Several Y columns - each Y column represent one of the categorical values. you may fill single event per row or the accumulate number of events per row
If you use more than one column the accumulated input data, based on i combination of X: (x₁..x_p), there is also a similar commonly used model is based on single events, therefore each data row is a single event. In this case, every row is a single event and Y_i may be only 1 or 0.

k+1 number of categorical outcomes. h=(0,1,..,k)
p number of independent variables. j=(0,1,..,p) 0 for the intersect
n - sample size or number of X combinations (x₁..x_p) i=(1,..,n).
y_i,h: the observed occurrences of outcome y(h) for i combination of X.
ŷ_i,h: the predicted occurrences of outcome y(h) for i combination of X.
e_i: the total occurrences for i combination of X, e_i=y_i,1+...+y_i,k

H₀: ln(P_h/P₀) = b_0,h
H₁: ln(P_h/P₀) = b_0,h+b_1,hX₁+...+b_p,hX_p

As you probably noticed h=0 is missing above, this because the accumulation of all the probabilities equals one, so you calculate it based on the other probabilities. $$\frac{p_h}{p_0}=e^{t_h} =e^{b_{0,h}+b_{1,h}x_1+...+b_{p,h}x_p}\quad,\quad p_0+p_1+...p_k=1 \quad\Rightarrow\\ p_0=\frac{1}{1+e^{t_1}+..+e^{t_k}}\quad p_h=\frac{e^{t_h}}{1+e^{t_1}+..+e^{t_k}} $$ The maximize log-likelihood method is used instead of the least squares method that is used in the linear regression. The method is based on the Binomial distribution.

Likelihood is the probability that the sample data will occur and the maximize log-likelihood method finds the P_i that will maximize this probability.
$$L=\prod_{i=1}^n \prod_{h=0}^r \hat p_{i,h}^{y_{i,h}}$$ $$LL=ln(L)=\sum_{i=1}^n \sum_{h=0}^r y_{i,h}ln(\hat p_{i,h})$$

Newton's Method

We use the newton's method to find the B vector of parameters that will maximize the Log-likelihood function based on the following iteration formula. delta - The tool will run the following iteration until the changes between each element in B_R+1 matrix and B_R matrix in smaller than delta. The default delta=0.00000001 or until arrived to the Maximum iterations. The default is maximum 15 iterations.
Usually you won't need to change the above parameter since after small number of iterations the delta will be very small. $$ B_{R+1}=B_R+(X'V_RX)^{-1}X'T⊙(P- \hat P_R)$$ R: iteration.
E: e_i vector (totals).
P: p_i vector. (observed probabilities).
p̂: p̂_i vector (predicted probabilities).
B: b_i vector. (coefficients).

$$ B_{R+1}=B_R+H^{-1}X'(Y-\hat Y_R)\\ B=\begin{bmatrix}b_{0,1} &b_{0,2} &...&b_{0,k}\\b_{1,1} &b_{1,2} &...&b_{1,k}\\...& ... & ... & ...\\b_{p,1} &b_{p,2} &...&b_{p,k}\\\end{bmatrix} B_0=\begin{bmatrix}0 &0 &...&0\\0 &0 &...&0\\...& ... & ... & ...\\0 &0 &...&0\\\end{bmatrix}\\\\$$ Matrices orders: X: n*(p+1), B: (p+1)*k
$$X=\begin{bmatrix} &1 &X_{11} &X_{12} & .. &X_{1p} \\ &1 &X_{21} &X_{22} & .. &X_{2p} \\ & : & : & : & : & : \\ &1 &X_{n1} &X_{n2} & .. &X_{np} \end{bmatrix} B=\begin{bmatrix}b_{0,1} &b_{0,2} &...&b_{0,k}\\b_{1,1} &b_{1,2} &...&b_{1,k}\\...& ... & ... & ...\\b_{p,1} &b_{p,2} &...&b_{p,k}\\\end{bmatrix} $$ For any h between 1 and k and i between 1 and n:
$$t_{i,1}=b_{0,1}+b_{1,1}x_{i,1}+...+b_{p,1}x_{i,p}\\ t_{i,2}=b_{0,2}+b_{1,2}x_{i,1}+...+b_{p,2}x_{i,p}\\ ...\\ t_{i,k}=b_{0,k}+b_{1,k}x_{i,1}+...+b_{p,k}x_{i,p}\\ T=XB' \quad$$ T matrix order: n*k
$$T=\begin{bmatrix}t_{1,1} &t_{1,2} &...&t_{1,k}\\...& ... & ... & ...\\t_{2,1} &t_{2,2} &...&t_{2,k}\\t_{n,1} &t_{n,2} &...&t_{n,k}\\\end{bmatrix}$$ There are _k*k W_i₁i₂ diagonal sub-matrices, each W is a matrix of order: n*n
When h₁=h₂
$$W_{h_1,h_2}=\begin{bmatrix} e_1 \hat p_{1,h_1}(1- \hat p_{1,h_1}) & 0 & 0 & 0\\ 0 & e_2 \hat p_{2,h_1}(1- \hat p_{2,h_1}) & 0 & 0 \\ 0 & 0 & ...& 0\\ 0 & 0 & 0 & e_n \hat p_{n,h_1}(1- \hat p_{n,h_1}) \end{bmatrix}$$ When h₁≠h₂
$$W_{h_1,h_2}=\begin{bmatrix} -e_1 \hat p_{1,h_1}\hat p_{1,h_2} & 0 & 0 & 0\\ 0 & -e_2 \hat p_{2,h_1}\hat p_{2,h_2} & 0 & 0 \\ 0 & 0 & ...& 0\\ 0 & 0 & 0 & -e_n \hat p_{n,h_1}\hat p_{n,h_2} \end{bmatrix}$$ There are k*k H_j₁j₂ sub-matrices, each sub-matrix H_j₁j₂ is a matrix of order: (p+1)*(p+1)
$$H_{h_1,h_2}=X'WX=\begin{bmatrix}V_{0,1} &V_{0,2} &...&V_{0,p}\\V_{1,1} &V_{1,2} &...&V_{1,p}\\...& ... & ... & ...\\V_{p,1} &V_{p,2} &...&V_{p,p}\\\end{bmatrix}\\$$ H the matrix of order: k(p+1)*k(p+1) (k*k sub-matrices of order (p+1)(p+1) ) $$H=\begin{bmatrix}H_{1,1} &H_{1,2} &...&H_{1,k}\\H_{2,1} &H_{2,2} &...&H_{2,k}\\...& ... & ... & ...\\H_{k,1} &H_{k,2} &...&H_{k,k}\\\end{bmatrix}\\$$

Numeric Example

Y - the favored color: Y₀ - Red, Y₁ - Blue, Y₂ - Green.
X₁ - age
X₂ - gender

X1	X2	Y0	Y1	Y2
10	0	10	30	16
11	0	11	33	11
12	0	12	36	12
13	0	13	39	13
10	1	12	31	13
11	1	13	34	14
12	1	14	37	15
13	1	15	40	16

1. Input Data - B₀ conatines zeroes or best estimation. For better example, we skip the first interation and start from the following B₁ matrix. $$X=\begin{bmatrix}1&10&0&\\1&11&0&\\1&12&0&\\1&13&0&\\1&10&1&\\1&11&1&\\1&12&1&\\1&13&1&\end{bmatrix}\quad Y=\begin{bmatrix}10&30&16&\\11&33&11&\\12&36&12&\\13&39&13&\\12&31&13&\\13&34&14&\\14&37&15&\\15&40&16&\end{bmatrix}\quad B_0=\begin{bmatrix}0&0&\\0&0&\\0&0&\end{bmatrix}\quad B_1=\begin{bmatrix}0.8816130281877883&0.629336209581317&\\0.02488619976105243&-0.047810749253181455&\\-0.13088367581011273&-0.02756271627471052&\end{bmatrix}$$ $$T=\begin{bmatrix}1.1305&0.1512&\\1.1554&0.1034&\\1.1802&0.05561&\\1.2051&0.007796&\\0.9996&0.1237&\\1.0245&0.07586&\\1.0494&0.02804&\\1.0742&-0.01977&\end{bmatrix}\quad Exp(T)=\begin{bmatrix}3.0971&1.1633&\\3.1752&1.1090&\\3.2552&1.0572&\\3.3372&1.0078&\\2.7172&1.1316&\\2.7856&1.0788&\\2.8558&1.0284&\\2.9278&0.9804&\end{bmatrix}\\
W_{1,1}=\begin{bmatrix}13.5588&0.000&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&13.1902&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&14.2372&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&15.2448&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&13.7958&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&14.9280&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&16.0265&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&0.000&17.0887&\end{bmatrix}\quad \color{blue}{C_{1,1}=\begin{bmatrix}118.0700&1366.3464&61.8391&\\1366.3464&15960.0999&716.6381&\\61.8391&716.6381&61.8391&\end{bmatrix}}\\ W_{1,2}=\begin{bmatrix}-7.2910&0.000&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&-6.9358&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&-7.3165&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&-7.6521&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&-7.3239&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&-7.7470&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&-8.1256&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&0.000&-8.4599&\end{bmatrix}\quad \color{green}{C_{1,2}=\begin{bmatrix}-60.8518&-702.4211&-31.6564&\\-702.4211&-8184.7008&-365.9419&\\-31.6564&-365.9419&-31.6564&\end{bmatrix}}\\ H=\begin{bmatrix}C_{1,1} & C_{1,2} \\C_{2,1} & C_{2,2} \end{bmatrix}\\ H=\begin{bmatrix}\color{blue}{118.0700}&\color{blue}{1366.3464}&\color{blue}{61.8391}&\color{green}{-60.8518}&\color{green}{-702.4211}&\color{green}{-31.6564}&\\ \color{blue}{1366.3464}&\color{blue}{15960.0999}&\color{blue}{716.6381}&\color{green}{-702.4211}&\color{green}{-8184.7008}&\color{green}{-365.9419}&\\ \color{blue}{61.8391}&\color{blue}{716.6381}&\color{blue}{61.8391}&\color{green}{-31.6564}&\color{green}{-365.9419}&\color{green}{-31.6564}&\\ \color{green}{-60.8518}&\color{green}{-702.4211}&\color{green}{-31.6564}&81.1422&936.0234&42.8676&\\ \color{green}{-702.4211}&\color{green}{-8184.7008}&\color{green}{-365.9419}&936.0234&10899.6902&495.1943&\\ \color{green}{-31.6564}&\color{green}{-365.9419}&\color{green}{-31.6564}&42.8676&495.1943&42.8676&\end{bmatrix}\\ S_{inv}=\begin{bmatrix}1.5030&-0.1274&-0.02527&1.1256&-0.09541&-0.01952&\\-0.1274&0.01101&-0.0003606&-0.09542&0.008253&-0.0002541&\\-0.02527&-0.0003606&0.05546&-0.01936&-0.0002683&0.04167&\\1.1256&-0.09542&-0.01936&2.1687&-0.1843&-0.03720&\\-0.09541&0.008253&-0.0002683&-0.1843&0.01598&-0.0005251&\\-0.01952&-0.0002541&0.04167&-0.03720&-0.0005251&0.08078&\end{bmatrix}\\ \hat Y=\begin{bmatrix}10.6456&32.9708&12.3836&\\10.4085&33.0489&11.5426&\\11.2944&36.7653&11.9403&\\12.1608&40.5832&12.2560&\\11.5492&31.3812&13.0695&\\12.5400&34.9318&13.5282&\\13.5128&38.5902&13.8971&\\14.4655&42.3521&14.1824&\end{bmatrix}\\ YP=\begin{bmatrix}-2.9708&3.6164&\\-0.04887&-0.5426&\\-0.7653&0.05975&\\-1.5832&0.7440&\\-0.3812&-0.06954&\\-0.9318&0.4718&\\-1.5902&1.1029&\\-2.3521&1.8176&\end{bmatrix}\\ XTP=\begin{bmatrix}-10.6235&7.2003&\\-123.7322&81.9424&\\-5.2553&3.3228&\end{bmatrix}\\ XTP_{flat}=\begin{bmatrix}-10.623468721743546\\-123.73216763608585\\-5.25532122412319\\7.200305308174219\\81.94239432654341\\3.3227876621204917\end{bmatrix}\\ Change=\begin{bmatrix}0.1477&-0.01891&-0.001247&\\0.3391&-0.02530&-0.02263&\end{bmatrix}\\ B_2=\begin{bmatrix}1.0293&0.9684&\\0.005978&-0.07311&\\-0.1321&-0.05019&\end{bmatrix}\\ $$ After iteration number 6: $$ B_6=\begin{bmatrix}1.0230&0.9536&\\0.006523&-0.07203&\\-0.1320&-0.04882&\end{bmatrix}\\ $$

X1	X2	Y0	Y1	Y2
10	0	10	30	16
11	0	11	33	11
12	0	12	36	12
13	0	13	39	13
10	1	12	31	13
11	1	13	34	14
12	1	14	37	15
13	1	15	40	16

X1	X2	Y0	Y1	Y2
10	0	10	30	16
11	0	11	33	11
12	0	12	36	12
13	0	13	39	13
10	1	12	31	13
11	1	13	34	14
12	1	14	37	15
13	1	15	40	16

X1	X2	Y0	Y1	Y2
10	0	10	30	16
11	0	11	33	11
12	0	12	36	12
13	0	13	39	13
10	1	12	31	13
11	1	13	34	14
12	1	14	37	15
13	1	15	40	16