Propensity Score Matching Generator

The PSM calculator generates matches between treated and the control group subjects to reduce selection bias.

Enter data in columns
Enter data from Excel
Header: you may rename 'Name-1', 'Name-2', etc.
Data: use Enter as delimiter; you may change the delimiters on 'More options'.

The input data must contain a numerical 'Outcome' column. The PSM calculator also performs a paired-t-test on the matched data, comparing the control subjects to the treated subjects.
If you only need the matched data, you may choose any numerical data as the outcome and ignore the paired t-test results.

Select the variables:

When to use the PSM?

You may use the propensity score analysis when you couldn't randomized the treatment. The propensity score matching help to reduce the effect of the confounding variables (covariates) by matching similar subjects between treatment group and the control group.

Propensity Score Matching diagram

Balance Estimation

Compares the distribution of each covariate between the treatment group and the control group.
A balanced propensity score does not imply balanced covariates (Austin, 2009), and vice versa.
For example, when using logistic regression to calculate the score, the score represents the probability of treatment based on the covariates.
Different combinations of covariates may lead to similar treatment probabilities (scores).
Since the data is matched using the score, it may result in non-balanced matching.

Standardized Mean Difference

There is no clear standard for the SB value of a balanced covariate. However, |SB| should be smaller than 0.2, or preferably smaller than 0.1 for important covariates.

Numerical Covariates

Calculates the standardized difference between the estimated mean of the treatment group and the estimated mean of the control group.

SB=μT-μCST

Categorical Covariates

For each value of the categorical covariate, calculates the standardized difference between the estimated proportion of the treatment group and the estimated proportion of the control group.

SB=P^T-P^CP^T(1-P^T)

Variance ratios balance

The variance ratio is the ratio of the variance of a covariate in the treatment group to the variance of the same covariate in the control group.
For a well-balanced matching, we expect these variances to be similar, with the ratio close to 1.
The variance ratio should fall between 0.5 and 2 (Rubin, 2001).

Input Data Structure

  1. ID - The first column contains the unique ID. The input data must include the ID column for data completeness, but the PSM process will not use it.
  2. Covariates - The subsequent columns represent covariates, which can be either numerical or categorical variables. These covariates are independent variables that are not of primary interest.
  3. Treatment - The second-to-last column represents the treatment. This is the key independent variable, and it should contain only values of 1 or 0 (1 indicates treatment, 0 indicates no treatment).
  4. Output - The last column contains numerical data, representing the dependent variable.

How to use the PSM calculator

PSM calculator with optional Excel file input. Propensity score estimation is performed using logistic regression, and matching is done using the nearest neighbor method, with an optional caliper.

How to enter data?

  • Enter raw data directly - usually you have the raw data.
    a. Enter the name of the group.
    b. Enter the raw data separated by 'comma', 'space', or 'enter'. (*you may copy only the data from excel).
  • Enter raw data from excel

    Enter the header on the first row.

    1. Copy Paste
      • a. copy the raw data with the header from Excel or Google sheets, or any tool that separates data with tab and line feed. copy the entire block, include the header .
      • Paste the data in the input field.
    2. Import data from an Excel or CSV file.
      When you select an Excel file, the calculator will automatically load the first sheet and display it in the input field. You can choose either an Excel file (.xlsx or .xls) or a CSV file (.csv).
      To upload your file, use one of the following methods:
      1. Browse and select – Click the 'Browse' button and choose the file from your computer.
      2. Drag and drop – Drag your file and drop it into the 'Drop your .xlsx, .xls, or .csv file here!' area.
      Once the file is uploaded, the PSM calculator will display the data from the first sheet in the input field.
      Now, the 'Select sheet' dropdown will be populated with the names of your sheets, and you can choose any sheet.
    3. Filter Data
      When using the 'Enter data from Excel' option, you can filter the data by clicking the following icon above the header: excel filter icon
      You may select one or more values from the dropdown. Please note that the filter will include any value that contains the values you choose.

Assumptions

  1. The treated subjects and the control subjects have a similar probability of receiving the treatment.
  2. All subjects receive the same type and amount of treatment.
  3. No general equilibrium effect - the control subjects don't get the treatment indirectly.
  4. Sufficient overlap between the treated group and the control group.
  5. Conditional independent - the outcomes (y) are independent of the treatment.
  6. No Hidden Bias: All confounding variables are included in the model

Logistic regression parameters

  1. Learning rate(α): Alpha represents the learning rate; common values typically range from 0.1 to 0.001. It represents the size of the gradient change in each iteration. It controls how much the coefficients are adjusted during each iteration of the optimization process. A smaller alpha means smaller steps in gradient descent, which can lead to more precise convergence but might require more iterations.
  2. Penalty(λ): This parameter controls the amount of regularization applied to the model. It is a shrinkage parameter that penalizes large coefficients to prevent overfitting. When lambda is set to zero, no regularization is applied, and the model behaves like ordinary least squares (OLS). As lambda increases, more penalty is applied, shrinking the coefficients towards zero. This helps in reducing model complexity and can improve generalization on unseen data
  3. Maximum Iterations: On each iteration, the algorithm changes the coefficients in a direction that will increase the log-likelihood. A higher number of iterations leads to better results until it reaches the maximum log-likelihood. In this case, more iterations will not lead to a better result.
  4. Maximum run time (Minutes): Limits the calculation time, even if the number of iterations does not reach the 'Maximum iterations' or the epsilon does not reach 0.
  5. Epsilon: For each coefficient, the absolute distance is calculated between its value in the current iteration and its value in the previous iteration.
    Epsilon is defined as the maximum absolute distance among all coefficients.
    The algorithm calculates epsilon every 100 iterations. When epsilon equals 0, the algorithm stops the calculation, as further iterations will not change the result.

Options

  1. Score order:
    Larger first - Sort the treatment scores in descending order and start matching from the highest treatment score.
    Smaller first - Sort the treatment scores in ascending order and start matching from the lowest treatment score.
  2. Caliper bandwidth
    No caliper - Use all treatment subjects; do not discard any treatment subjects.
    Distance - Discard a treatment subject if the distance to the nearest control subject is greater than the 'caliper distance'.
    Threshold = Caliper distance.
    Standardized distance - Discard a treatment subject if the distance to the nearest control subject exceeds the 'caliper distance' multiplied by the sample standard deviation of the scores.
    Threshold = Caliper distance * S(treatment scores).
  3. Caliper distance - Used to calculate the threshold value in Caliper bandwidth.
  4. Logistic regression - Displays the names of the coefficients or only x1, x2, x3 etc.
  5. Matching report
    Show only ID and Score - Displays 'Treatment ID', 'Treatment score', 'Control ID', 'Control score', and 'Distance'.
    Show all columns - Also includes covariates and any other variable not included in the PSM process.
  6. Clean - clean the data automatically before running the PSM process.
  7. Missing data values - define the data that will be counted as missing data, such as NA, "", or N/A.
    You may add more comma delimited values.
  8. Clean variables
    Numerical - remove subjects only if missing values are found in numerical variables.
    All - remove subjects if missing values are found in categorical variables or numerical variables.
  9. Excel pagination display - Specifies the number of rows per tab. When you load a large Excel file, it will be displayed across multiple tabs.
  10. Rounding - how to round the results?
    When a resulting value is larger than one, the tool rounds it, but when a resulting value is less than one the tool displays the significant figures.

Clean Data

If you choose to clean the data, data cleaning will occur automatically before running the PSM process.
If there are duplicate IDs, you will receive a warning, but the PSM process will not remove the record. However, the PSM process will remove records in the following cases:

  1. Subjects with missing values as defined in the "Missing Data" field (e.g., "NA", "").
  2. Treatment values that are not 0 or 1.
  3. Outcome values that are not numerical.

Covariate Types

The calculator checks each covariate. If it finds even one non-numerical value, it defines the variable as categorical.
Please check the "Covariates" table to ensure that all categorical variables are intended to be categorical. If not, correct any non-numerical values in numerical variables.