Logistic Regression

The logistic regression is a method to calculate the relationships between a nominal categorical dependent variable (Y) and continuous/categorical independent variables (Xi) like the independent variables in the linear regression. For example, the color dependent variable (Y) with the following value: blue, red or green.

Binary Logistic Regression (Go to the calculator)

When the dependent variable is a binary variable, also called a dichotomous variable, you should use the Binary Logistic Regression. The model will calculate the probability for the category to occur based on the independent variables, Xj.
The dependent variable Y may have only two options 1 or 0, such as the win or lose and success or failure. The following model required the accumulated input data, based on i combination of X: (x1..xp), there is also a similar commonly used model is based on single events, therefore each data row is a single event. In this case, every row is a single event, and Yi value may be only 1 or 0.

Odds is the ratio between the probability that the event will happen to the probability it won't happen
The odds is actually similar to the probability but from a different angel. $$odds=\frac{p}{1-p}$$ Examples
When P = 1/3 the odds are 1:2 (odds = 0.5).
When P = 1/2 the odds are 1:1 (odds = 1).

H0: ln(odds) = b0
H1: ln(odds) = b0+b1X1+...+bpXp

$$ odds(x_1..x_p)=\frac{p(x_1..x_p)}{1-p(x_1..x_p)}=e^{b_0+b_1x_1+...+b_px_p}\quad\Rightarrow\quad p(x_1..x_p)=\frac{1}{1+e^{-(b_0+b_1x_1+...+b_px_p)}}$$ The maximize log-likelihood method is used instead of the least squares method that is used in the linear regression. The method is based on the Binomial distribution.

Likelihood is the probability that the sample data will occur and the maximize log-likelihood method finds the Pi that will maximize this probability.
n: The number of X combinations (x1..xp) $$L=\prod_{i=1}^n \hat p^{y_i(1)}(1- \hat p_i)^{y_i(0)}$$ $$LL=ln(L)=\sum_{1=1}^n{(y_i(1) ln(\hat p_i) + y_i(0)ln(1- \hat p_i))}$$

Newton's Method

We use the newton's method to find the B vector of parameters that will maximize the Log-likelihood function based on the following iteration formula. The iteration loop will run until the differences between BR+1 and BR will limit to zero for each coefficient element. B0=[0,0,..0]
$$ V=\begin{bmatrix} e_1 \hat p_1(1- \hat p_1) & 0 & 0 & 0\\ 0 & e_2 \hat p_2(1- \hat p_2) & 0 & 0 \\ 0 & 0 & ...& 0\\ 0 & 0 & 0 & e_n \hat p_n(1- \hat p_n) \end{bmatrix}$$ R: iteration.
T: ti vector.
P: pi vector. (observed probabilities)
p̂: p̂i vector (predicted probabilities).
B: bi vector. (coefficients)
$$ B_{R+1}=B_R+(X'V_RX)^{-1}X'T⊙(P- \hat P_R)$$ All the multiplications are matrix multiplications, except for the last which is an element wise multiplication . Example: [2,3] ⊙ [4,5] =[8,15].

Multinomial Logistic Regression (Go to the calculator)

When the dependent variable can get more than two categorical values, you should use the Multinomial Logistic Regression. The model will calculate the probability for the category to occur based on the independent variables, Xj. The dependent variable Y may have more than two options, for example Red, Blue, Green.

You may use one of the following input data methods:
Single Y column - each data row is a single event, Y column contains one of the categoritcal values.
Several Y columns - each Y column represent one of the categorical values. you may fill single event per row or the accumulate number of events per row
If you use more than one column the accumulated input data, based on i combination of X: (x1..xp), there is also a similar commonly used model is based on single events, therefore each data row is a single event. In this case, every row is a single event and Yi may be only 1 or 0.

H0: ln(Ph/P0) = b0,h
H1: ln(Ph/P0) = b0,h+b1,hX1+...+bp,hXp

As you probably noticed h=0 is missing above, this because the accumulation of all the probabilities equals one, so you calculate it based on the other probabilities. $$\frac{p_h}{p_0}=e^{t_h} =e^{b_{0,h}+b_{1,h}x_1+...+b_{p,h}x_p}\quad,\quad p_0+p_1+...p_k=1 \quad\Rightarrow\quad p_0=\frac{1}{1+e^{t_1}+..+e^{t_k}}\quad p_h=\frac{e^{t_h}}{1+e^{t_1}+..+e^{t_k}} $$ The maximize log-likelihood method is used instead of the least squares method that is used in the linear regression. The method is based on the Binomial distribution.

Likelihood is the probability that the sample data will occur and the maximize log-likelihood method finds the Pi that will maximize this probability.
$$L=\prod_{i=1}^n \prod_{h=0}^r \hat p_{i,h}^{y_{i,h}}$$ $$LL=ln(L)=\sum_{i=1}^n \sum_{h=0}^r y_{i,h}ln(\hat p_{i,h})$$

Newton's Method

We use the newton's method to find the B vector of parameters that will maximize the Log-likelihood function based on the following iteration formula. delta - The tool will run the following iteration until the changes between each element in BR+1 matrix and BR matrix in smaller than delta. The default delta=0.00000001 or until arrived to the Maximum iterations. The default is maximum 15 iterations.
Usually you won't need to change the above parameter since after small number of iterations the delta will be very small. $$ B_{R+1}=B_R+(X'V_RX)^{-1}X'T⊙(P- \hat P_R)$$ R: iteration.
E: ei vector (totals).
P: pi vector. (observed probabilities).
p̂: p̂i vector (predicted probabilities).
B: bi vector. (coefficients).

$$ B_{R+1}=B_R+H^{-1}X'(Y-\hat Y_R)\\ B=\begin{bmatrix}b_{0,1} &b_{0,2} &...&b_{0,k}\\b_{1,1} &b_{1,2} &...&b_{1,k}\\...& ... & ... & ...\\b_{p,1} &b_{p,2} &...&b_{p,k}\\\end{bmatrix} B_0=\begin{bmatrix}0 &0 &...&0\\0 &0 &...&0\\...& ... & ... & ...\\0 &0 &...&0\\\end{bmatrix}\\\\$$ Matrices orders: X: n*(p+1), B: (p+1)*k
$$X=\begin{bmatrix} &1 &X_{11} &X_{12} & .. &X_{1p} \\ &1 &X_{21} &X_{22} & .. &X_{2p} \\ & : & : & : & : & : \\ &1 &X_{n1} &X_{n2} & .. &X_{np} \end{bmatrix} B=\begin{bmatrix}b_{0,1} &b_{0,2} &...&b_{0,k}\\b_{1,1} &b_{1,2} &...&b_{1,k}\\...& ... & ... & ...\\b_{p,1} &b_{p,2} &...&b_{p,k}\\\end{bmatrix} $$ For any h between 1 and k and i between 1 and n:
$$t_{i,1}=b_{0,1}+b_{1,1}x_{i,1}+...+b_{p,1}x_{i,p}\\ t_{i,2}=b_{0,2}+b_{1,2}x_{i,1}+...+b_{p,2}x_{i,p}\\ ...\\ t_{i,k}=b_{0,k}+b_{1,k}x_{i,1}+...+b_{p,k}x_{i,p}\\ T=XB' \quad$$ T matrix order: n*k
$$T=\begin{bmatrix}t_{1,1} &t_{1,2} &...&t_{1,k}\\...& ... & ... & ...\\t_{2,1} &t_{2,2} &...&t_{2,k}\\t_{n,1} &t_{n,2} &...&t_{n,k}\\\end{bmatrix}$$ There are k*k Wi1i2 diagonal sub-matrices, each W is a matrix of order: n*n
When h1=h2
$$W_{h_1,h_2}=\begin{bmatrix} e_1 \hat p_{1,h_1}(1- \hat p_{1,h_1}) & 0 & 0 & 0\\ 0 & e_2 \hat p_{2,h_1}(1- \hat p_{2,h_1}) & 0 & 0 \\ 0 & 0 & ...& 0\\ 0 & 0 & 0 & e_n \hat p_{n,h_1}(1- \hat p_{n,h_1}) \end{bmatrix}$$ When h1≠h2
$$W_{h_1,h_2}=\begin{bmatrix} -e_1 \hat p_{1,h_1}\hat p_{1,h_2} & 0 & 0 & 0\\ 0 & -e_2 \hat p_{2,h_1}\hat p_{2,h_2} & 0 & 0 \\ 0 & 0 & ...& 0\\ 0 & 0 & 0 & -e_n \hat p_{n,h_1}\hat p_{n,h_2} \end{bmatrix}$$ There are k*k Hj1j2 sub-matrices, each sub-matrix Hj1j2 is a matrix of order: (p+1)*(p+1)
$$H_{h_1,h_2}=X'WX=\begin{bmatrix}V_{0,1} &V_{0,2} &...&V_{0,p}\\V_{1,1} &V_{1,2} &...&V_{1,p}\\...& ... & ... & ...\\V_{p,1} &V_{p,2} &...&V_{p,p}\\\end{bmatrix}\\$$ H the matrix of order: k(p+1)*k(p+1) (k*k sub-matrices of order (p+1)(p+1) ) $$H=\begin{bmatrix}H_{1,1} &H_{1,2} &...&H_{1,k}\\H_{2,1} &H_{2,2} &...&H_{2,k}\\...& ... & ... & ...\\H_{k,1} &H_{k,2} &...&H_{k,k}\\\end{bmatrix}\\$$

Numeric Example

Y - the favored color: Y0 - Red, Y1 - Blue, Y2 - Green.
X1 - age
X2 - gender

X1X2Y0Y1Y2
100103016
110113311
120123612
130133913
101123113
111133414
121143715
131154016

1. Input Data - B0 conatines zeroes or best estimation. For better example, we skip the first interation and start from the following B1 matrix. $$X=\begin{bmatrix}1&10&0&\\1&11&0&\\1&12&0&\\1&13&0&\\1&10&1&\\1&11&1&\\1&12&1&\\1&13&1&\end{bmatrix}\quad Y=\begin{bmatrix}10&30&16&\\11&33&11&\\12&36&12&\\13&39&13&\\12&31&13&\\13&34&14&\\14&37&15&\\15&40&16&\end{bmatrix}\quad B_0=\begin{bmatrix}0&0&\\0&0&\\0&0&\end{bmatrix}\quad B_1=\begin{bmatrix}0.8816130281877883&0.629336209581317&\\0.02488619976105243&-0.047810749253181455&\\-0.13088367581011273&-0.02756271627471052&\end{bmatrix}$$ $$T=\begin{bmatrix}1.1305&0.1512&\\1.1554&0.1034&\\1.1802&0.05561&\\1.2051&0.007796&\\0.9996&0.1237&\\1.0245&0.07586&\\1.0494&0.02804&\\1.0742&-0.01977&\end{bmatrix}\quad Exp(T)=\begin{bmatrix}3.0971&1.1633&\\3.1752&1.1090&\\3.2552&1.0572&\\3.3372&1.0078&\\2.7172&1.1316&\\2.7856&1.0788&\\2.8558&1.0284&\\2.9278&0.9804&\end{bmatrix}\\
W_{1,1}=\begin{bmatrix}13.5588&0.000&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&13.1902&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&14.2372&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&15.2448&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&13.7958&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&14.9280&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&16.0265&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&0.000&17.0887&\end{bmatrix}\quad \color{blue}{C_{1,1}=\begin{bmatrix}118.0700&1366.3464&61.8391&\\1366.3464&15960.0999&716.6381&\\61.8391&716.6381&61.8391&\end{bmatrix}}\\ W_{1,2}=\begin{bmatrix}-7.2910&0.000&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&-6.9358&0.000&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&-7.3165&0.000&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&-7.6521&0.000&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&-7.3239&0.000&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&-7.7470&0.000&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&-8.1256&0.000&\\0.000&0.000&0.000&0.000&0.000&0.000&0.000&-8.4599&\end{bmatrix}\quad \color{green}{C_{1,2}=\begin{bmatrix}-60.8518&-702.4211&-31.6564&\\-702.4211&-8184.7008&-365.9419&\\-31.6564&-365.9419&-31.6564&\end{bmatrix}}\\ H=\begin{bmatrix}C_{1,1} & C_{1,2} \\C_{2,1} & C_{2,2} \end{bmatrix}\\ H=\begin{bmatrix}\color{blue}{118.0700}&\color{blue}{1366.3464}&\color{blue}{61.8391}&\color{green}{-60.8518}&\color{green}{-702.4211}&\color{green}{-31.6564}&\\ \color{blue}{1366.3464}&\color{blue}{15960.0999}&\color{blue}{716.6381}&\color{green}{-702.4211}&\color{green}{-8184.7008}&\color{green}{-365.9419}&\\ \color{blue}{61.8391}&\color{blue}{716.6381}&\color{blue}{61.8391}&\color{green}{-31.6564}&\color{green}{-365.9419}&\color{green}{-31.6564}&\\ \color{green}{-60.8518}&\color{green}{-702.4211}&\color{green}{-31.6564}&81.1422&936.0234&42.8676&\\ \color{green}{-702.4211}&\color{green}{-8184.7008}&\color{green}{-365.9419}&936.0234&10899.6902&495.1943&\\ \color{green}{-31.6564}&\color{green}{-365.9419}&\color{green}{-31.6564}&42.8676&495.1943&42.8676&\end{bmatrix}\\ S_{inv}=\begin{bmatrix}1.5030&-0.1274&-0.02527&1.1256&-0.09541&-0.01952&\\-0.1274&0.01101&-0.0003606&-0.09542&0.008253&-0.0002541&\\-0.02527&-0.0003606&0.05546&-0.01936&-0.0002683&0.04167&\\1.1256&-0.09542&-0.01936&2.1687&-0.1843&-0.03720&\\-0.09541&0.008253&-0.0002683&-0.1843&0.01598&-0.0005251&\\-0.01952&-0.0002541&0.04167&-0.03720&-0.0005251&0.08078&\end{bmatrix}\\ \hat Y=\begin{bmatrix}10.6456&32.9708&12.3836&\\10.4085&33.0489&11.5426&\\11.2944&36.7653&11.9403&\\12.1608&40.5832&12.2560&\\11.5492&31.3812&13.0695&\\12.5400&34.9318&13.5282&\\13.5128&38.5902&13.8971&\\14.4655&42.3521&14.1824&\end{bmatrix}\\ YP=\begin{bmatrix}-2.9708&3.6164&\\-0.04887&-0.5426&\\-0.7653&0.05975&\\-1.5832&0.7440&\\-0.3812&-0.06954&\\-0.9318&0.4718&\\-1.5902&1.1029&\\-2.3521&1.8176&\end{bmatrix}\\ XTP=\begin{bmatrix}-10.6235&7.2003&\\-123.7322&81.9424&\\-5.2553&3.3228&\end{bmatrix}\\ XTP_{flat}=\begin{bmatrix}-10.623468721743546\\-123.73216763608585\\-5.25532122412319\\7.200305308174219\\81.94239432654341\\3.3227876621204917\end{bmatrix}\\ Change=\begin{bmatrix}0.1477&-0.01891&-0.001247&\\0.3391&-0.02530&-0.02263&\end{bmatrix}\\ B_2=\begin{bmatrix}1.0293&0.9684&\\0.005978&-0.07311&\\-0.1321&-0.05019&\end{bmatrix}\\ $$ After iteration number 6: $$ B_6=\begin{bmatrix}1.0230&0.9536&\\0.006523&-0.07203&\\-0.1320&-0.04882&\end{bmatrix}\\ $$