Methods

Data

Updated information regarding what data is used can be found in the changelog page.

Data pre-processing

30 day time-window

Events for a given individual and a given phenocode will be merged if they are less than or equal to 30 days apart. For example if an individual as K11_APPENDACUT events at the following dates: 2000-01-01, 2000-01-20, 2000-02-10, 2000-02-28, then all these events will become one at date 2000-01-01.

This is done as an attempt to remove events that are follow-ups rather than initial diagnoses.

Statistics

Unadjusted prevalence

Number of individuals having at least one event for a given phenocode, divided by the total number of individuals in the FinnGen study. No adjustment is done to account for the difference between the age distribution of the FinnGen cohort and the one of the Finnish population.

Mortality

The implementation of the mortality statistics makes use of:

Hazard Ratios (HR), p & N

The model used is: y ~ prior endpoint + birth year + sex

If the endpoint is sex-specific, then the sex covariate is removed from the model.

Lagged hazard ratios are computed by considering only up to 1, 5, and 15 years of exposed time.

The regression are done using the lifelines library.

Absolute Risk (AR)

The absolute risk represents the probability of dying. It is defined as AR = 1 - survival_probability. The survival probability is derived from the fitted Cox model with the following parameters:

  • year of birth: 1959
  • sex ratio: 50%

Survival analyses between phenocodes

Most of the study follows the NB-COMO study.

Data pre-processing

  • Start of study: 1998-01-01
  • End of study: 2018-12-31
  • Prevalent cases removed from the study.
  • Ignore time before start of study for individuals having the prior-phenocode before the study starts.
  • Split time in unexposed and exposed periods.
  • Only consider endpoint pairs:
    • with at least 10 individuals for each cell of the contingency table of this endpoint pair.
    • with at least 25 individuals having the outcome endpoint.
    • where ICDs of both endpoints as well as there parents don't overlap.
    • where endpoints are not descendants of one another in the endpoint tree hierarchy.

Cox regression

The model used is: y ~ prior + birth_year + sex

If the endpoint is sex-specific, then the sex covariate is removed from the model.

Lagged hazard ratios are computed by considering only up to 1, 5, and 15 years of exposed time.

The regression are done using the lifelines library.

Notes

Due to the sensitive nature of the data, the age when entering and leaving the study has an accuracy of 1 year.

Drug Statistics

The drug score is computed in a 2-step process:

  1. Fit the data to the logistic model:
    y ~ sex + year-of-birth + year-of-birth^2 + year-at-endpoint + year-at-endpoint^2
  2. Use the fitted model to predict the probability for the following data:
    • sex = 0.5, assume an even number of females and males.
    • year-of-birth = 1960, the mean year of birth of the FinnGen cohort.
    • year-at-endpoint = 2018, predict the probability at the end of the study.

The resulting probability value is the drug score. The highest the drug score is, the more likely the drug is to be taken after the given endpoint.

Source code

Availabe on GitHub for both the data processing pipeline and the website.