R learning / survival analysis

Posted by Sir Jos on Wed, 12 Jan 2022 10:44:02 +0100

1. General

1.1 basic concepts

survival time: generally refers to the time from the start event to the end event, such as the time from onset to death of a patient suffering from a disease.
Failure event and start event: failure event generally refers to death event or end event. Initial events are events that reflect the initial characteristics of survival time, such as the diagnosis of disease, the beginning of treatment, etc. Both shall be clearly specified in the design.
censored data: refers to the failure to observe the definite outcome (end event) of the patient for some reason during the follow-up, or amputation. It may be loss of follow-up, withdrawal or termination, and its survival time is generally represented by "+".
Distribution characteristics of survival time data: generally obtained through follow-up. Due to the long observation time and difficult to control confounding factors, coupled with the existence of censored data, the law is difficult to estimate, which is generally normal skew distribution.

1.2 main research contents

Describe the survival process: estimate the survival rate and average survival time, draw the survival curve to describe the distribution characteristics of survival time. The commonly used methods are Kaplan Meier method (product limit method), life method and other nonparametric methods (without assuming the distribution type of survival time).
Compare the survival process: compare the survival rate and standard error of each sample, and explore whether there are differences in the overall survival process. Common methods: log rank test, compare the survival curves of two or more groups.
Analysis of factors affecting survival time: usually taking survival time and outcome as dependent variables and influencing factors as independent variables, fit the survival analysis model to explore the factors affecting survival time. Common methods: cox model (semi parametric method), log logistic regression analysis, Weibull distribution method and other parameter methods (parameter distribution type of survival time to be assumed).
Competitive risk model (to be supplemented when available)

2. Cox model | R

2.1 R package

The R package of survival analysis generally uses survival package and survivminer package. The survival package is used for analysis and the survivminer package is used for drawing.

Core functions of survival package:

Surv(): create a living object
survfit(): fit the survival curve with the formula or cox model
coxph(): fitting cox proportional hazards regression model
survdiff(): log rank or mantel Haenszel test was used to test the survival difference
cox.zph(): testing the proportional risk hypothesis of Cox model

data$surv <- Surv(time, status)
Surv(time, time2, event, type=c('right', 'left', 'interval', 'counting', 'interval2', 'mstate'),origin=0)
# Time: for right deletion, it refers to the follow-up time. If it is interval data, it is the start time of interval data
# time2: end time of interval data
# Event: outcome variable. By default, 0 is deleted and 1 is the end event. You can specify the end event yourself. data$status == 2 indicates that the endpoint event occurs when the value is considered to be 2
# Type: specify deletion type. However, if the type is not specially specified, it will be automatically selected by default according to the data
lung$surv <- Surv(time = lung$time, lung$status == 2)

2.2 case analysis

Draw the survival curve and compare the survival rate

ggsurvplot(): draw survival curve

library("survival") # cox
library("survminer") # data visualization
head(lung)
# Fit Kaplan Meier curve
sfit <- survfit(formula = Surv(time, status)~1,data = lung) #population
summary(sfit)
sfit <- survfit(formula = Surv(time, status)~sex, data = lung) # grouping
summary(sfit)
summary(sfit, times=seq(0, 1000, 100)) # Set the time parameter to select the time interval
ggsurvplot(sfit, data = lung)

res.sum <- surv_summary(sfit)
head(res.sum[res.sum$sex==1,])
head(res.sum[res.sum$sex==2,])
ggsurvplot(sfit,pval = TRUE, conf.int = TRUE,
           risk.table = TRUE) #Visual K-M survival curve

surv_diff <- survdiff(Surv(time, status) ~ sex, data = lung) # log-rank test
surv_diff
p.val = 1 - pchisq(surv_diff$chisq, length(surv_diff$n) - 1)
p.val

lung$age_cut <- cut(lung$age, breaks = c(0, 70, Inf), labels = c("young", "old"))
fit <- survfit(Surv(time, status)~age_cut, data = lung)
## If you directly use surffit (), it will treat all ages as a factor
summary(fit)
ggsurvplot(fit, data = lung)

In the new version of survminer 0.2.4, the function surv which can determine the best partition point of one or more continuous variables at one time is added_ Cutpoint() and surv_categorize()

Cox regression analysis

cox regression model: coxph (surv (time, status) ~ X1 + x2 +, data = )
Single factor analysis: batch analysis and display of results can be performed with lap(), sapply(), function()
Multivariate analysis: ggforest() shows the HR of each factor with forest map

fit <- coxph(Surv(time, status)~sex, data=lung) #univariable 
summary(fit)
cox.zph(fit)
plot(cox.zph(fit))
#Visualize the performance of each variable under the cox model. When the data is obviously separated, it indicates that the performance is not very ideal

# univariable 
covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,function(x) as.formula(paste('Surv(time, status)~', x)))
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
hr_results <- function(x){
		x <- summary(x)
        p.value<-signif(x$wald["pvalue"], digits=2)
        wald.test<-signif(x$wald["test"], digits=2)
        beta<-signif(x$coef[1], digits=2) # coeficient beta
        HR <-round(x$coef[2], 2) # exp(beta)
        HR.confint.lower <- round(x$conf.int[,"lower .95"], 2)
        HR.confint.upper <- round(x$conf.int[,"upper .95"],2)
        HR.with.CI <- paste0(HR, " (", HR.confint.lower, "-", HR.confint.upper, ")")
        res<-c(beta, HR.with.CI, wald.test, p.value)
        #names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", "p.value")
        return(res)}
univ_results <- lapply(univ_models,hr_results)
res <- t(as.data.frame(univ_results, check.names = FALSE)) # Transpose
res1 <- as.data.frame(res)

# multi-vars
res.cox <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
summary(res.cox)
ggforest(res.cox, data = lung, 
         main = "Hazard ratio",
         cpositions = c(0.10, 0.22, 0.4),
         fontsize = 1.0)

Important assumption of Cox regression: the influence of variables on hazard rate does not change with time (equal proportion), which can be determined by Cox Zph (FIT) test

multiple imputation

sum(!complete.cases(lung))

md.pattern(lung, 5)
lung_cmplt <- mice(lung, 5) # 5 kinds of optional interpolation
lung_cmplt_3 <- complete(lung_cmplt, 3) # Select a complete subset for subsequent analysis?
lung_cmplt_3$surv <- Surv(lung_cmplt_3$time, lung_cmplt_3$status == 2)
fit <- survfit(surv~age, data = lung_cmplt)
ggsurvplot(fit, pval = TRUE)

Multiple imputation method and cox regression were used for sensitivity analysis.

2.3 advanced version of survival curve

ggsurvplot() is a generic function to plot survival curves.
ggsurvplot_list() draws multiple objects
ggsurvplot_facet() facets to multiple panels
ggsurvplot_group_by() multiple groups in a picture
ggsurvplot_add_all() sum up all the cases
ggsurvplot_combine() combines multiple surffit objects in a graph

p1 <- ggsurvplot(fit,
           pval = TRUE, conf.int = TRUE,
           risk.table = TRUE, # Add risk table
           risk.table.col = "strata", # Change risk table color by groups
           linetype = "strata", # Change line type by groups
           surv.median.line = "hv", # Specify median survival
           ggtheme = theme_bw(), # Change ggplot2 theme
           palette = c("#E7B800", "#2E9FDF"))

p2 <- ggsurvplot(fit, 
                 pval = TRUE,               # show p-value of log-rank test.
                 conf.int = TRUE,           # show confidence intervals for point estimaes of survival curves
                 conf.int.style = "step",   # customize style of confidence intervals
                 xlab = "Time in days",     # customize X axis label.
                 break.time.by = 200,       # break X axis in time intervals by 200.
                 ggtheme = theme_light(),   # customize plot and risk table with a theme.
                 risk.table = "abs_pct",    # absolute number and percentage at risk.
                 risk.table.y.text.col = T, # colour risk table text annotations.
                 risk.table.y.text = FALSE, # show bars instead of names in text annotations in legend of risk table
                 ncensor.plot = TRUE,       # plot the number of censored subjects at time t
                 surv.median.line = "hv",   # add the median survival pointer.
                 legend.labs = c("Male", "Female"),    # change legend labels.
                 palette = c("#E7B800", "#2E9FDF")     # custom color palettes.
)
arrange_ggsurvplots(list(p1,p2))

The parameter fun="event" indicates the cumulative event occurrence rate
The parameter fun="cumhaz" indicates the cumulative risk level of cummulative hazard (at time t, the possibility of event occurrence)

p3 <- ggsurvplot(fit,
           conf.int = TRUE,
           risk.table.col = "strata", # Change risk table color by groups
           ggtheme = theme_bw(), # Change ggplot2 theme
           palette = c("#E7B800", "#2E9FDF"),
           fun = "event")
p4 <- ggsurvplot(fit,
              conf.int = TRUE,
              risk.table.col = "strata", # Change risk table color by groups
              ggtheme = theme_bw(), # Change ggplot2 theme
              palette = c("#E7B800", "#2E9FDF"),
              fun = "cumhaz") 
arrange_ggsurvplots(list(p3,p4))

The use of ggsurvplot() and ggforest() also needs to be improved in actual combat. After using R to get the vector diagram, we can debug various legend parameters and graphics proportions in AI.

3. References

Topics: R Language

Programmer Think