Survival analysis with lifelines [Python version]

Posted by jmgrhm on Thu, 24 Feb 2022 13:06:29 +0100

Opening remarks

Survival analysis accounts for a large proportion in medical research, and when conducting survival analysis, R language, SPSS and other tools are often used for survival analysis, while Python is rarely used for survival analysis. Because we found a python version of survival analysis tool - lifelines, this library has provided relatively complete tools related to survival analysis. I recently studied survival analysis, and then began to write this project in combination with lifelines. While writing the code, we also show some conceptual nouns in survival analysis according to our own understanding. Because you write while learning, please correct any mistakes.

#Install the python library for survival analysis ----- lifelines
#Lifelines related document address https://lifelines.readthedocs.io/en/latest/
!pip install lifelines
!pip install  numpy==1.19.2
!pip install pyzmq==18.1.1
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

1. What is survival analysis

In medicine, survival analysis is one of the very common and important analysis methods. It is a statistical method that combines the results of events and the time experienced by such results.

Case 1:
To study what factors affect the survival time of patients with rectal cancer. Will collect patient information. For example, there are age, gender, marital status, TMN stage, T stage, tumor diameter, tumor differentiation, treatment methods, etc. In addition, it will collect the time of death events in a certain period of time, that is, the survival time. Kaplan Meier curve was used for univariate analysis, which factor was the survival time of patients with rectal cancer, or Cox proportional hazards regression was used for multivariate analysis. The survival time of new patients with rectal cancer was predicted.

Case 2: in order to make a clinical trial of an anticancer drug, cancer patients were randomly divided into two groups. One group took the anticancer drug and the other group took the control drug. The survival time from the beginning of medication to death was recorded. And analyze whether there are differences in the relationship between the occurrence and time of death events in the two groups to judge whether the drugs are effective.

It can be seen from the two cases that survival analysis is to study the relationship between the occurrence of events and time under different conditions, not only considering whether events occur, but also considering the length of time of events.

1.1 what is the data of survival analysis

The data of survival analysis is somewhat similar to the classified data. However, the survival analysis adds a column of time data. This column of time data is from the occurrence of specific events, recording how long the study object had failure events (such as death events) during the whole observation period. Survival analysis is to study the relationship between survival time and outcome and many influencing factors.

For example, the following table records the survival time and influencing factors of interest of 6 cancer patients during the 2-year observation period. The failure event here is a death event, which we are concerned about. The lifetime corresponds to the lifetime of the table. The influencing factors were age, gender, smoking and tumor size.

In the table, there is a patient whose survival time is only 21 months, but the outcome is survival. It may be that the patient withdrew from observation or changed his phone number, so he can't contact the patient and continue the research. This kind of data belongs to deleted data. It can be seen that in real life, the survival analysis data usually contains deleted data. For another example, in the table, there is a patient whose survival time is 19 months and the outcome is death. This kind of data belongs to complete data. However, these deleted data will also be analyzed together during survival analysis.

AgeGenderDo you smokeTumor size (mm)Survival time (months)event
55maleyes1723survival
65femaleyes2024survival
36femaleno1619death
38maleyes2521survival
69maleno1623survival
70maleno2820death

1.2 common technical terms in survival analysis

Survival analysis has some common technical terms. For example, what is the survival rate, what is the survival curve, etc.

1.3 main purpose of survival analysis

  1. Estimate the survival rate of the subjects
  2. The survival rates of 2 different groups (influencing factors) were compared
  3. Analyze the factors and contribution that affect the survival time of the subjects

1.4 main methods of survival analysis

  1. Life table method and Kaplan Meier method (belonging to statistical description)
  2. Log rank test (belonging to statistical inference)
  3. Cox proportional hazards regression model (belonging to statistical inference)

[1] Life table and KM method belong to nonparametric methods, which are used to statistically describe survival data, such as estimating survival function in each time period and drawing survival curve. Life span method is mostly used for data with large samples and large follow-up time span. For example, the study object is followed up only once every 1 year. The product limit method is used to calculate the survival function by KM method and life span method. However, in practice, life table analysis is not very commonly used, and KM method is often used for analysis. Km method can calculate the median survival time, group the data, draw the survival curve respectively and conduct single factor analysis (single factor analysis can only analyze classified variables. If continuous variables are analyzed, continuous variables need to be grouped to become classification).

[2] Log rank test is a nonparametric test, which is used to compare the survival curves or survival time of two or more groups. It can compare the survival curves drawn after grouping in KM method.

[3] Cox proportional hazard regression model belongs to the semi parametric method, which can estimate the survival function, compare two or more groups of survival functions, single factor analysis or multi factor analysis. Single factor analysis includes classified variables and continuous variables. It can establish the relationship model between survival time and influencing factors and calculate the risk ratio of influencing factors. Note that Cox regression requires to meet the proportional risk assumption, that is, the influencing factors do not change with time.

1.5 evaluation indicators of survival analysis

On the one hand, the goodness of fit of the model, such as R-square, AIC, etc. On the one hand, there are model prediction accuracy, such as mean square error, AUC, etc. More attention is paid to the model prediction accuracy in clinic. The commonly used evaluation indicators of model prediction accuracy in survival analysis modeling are (here only talk about C-index index index...):

C-index (consistency index)
c-index reflects the consistency between the predicted results of the model and the actual situation.
The calculation method is to randomly form a pair of all research objects. The object with longer survival time predicted by the model has longer actual survival time, or the survival time with high predicted survival probability has longer actual survival time.

2. Data processing and analysis

2.1 view the baseline data for breast cancer data in this project.

The invalidation of this breast cancer dataset is "Dead" in "Status", and the survival time is Survival_Months.

The characteristics of this breast cancer dataset are as follows:

  • Age
  • Race
  • Marital_ Status (marital status)
  • T_ Stage (primary tumor size)
  • N_ Stage (lymph nodes with cancer cell spread)
  • Sixth_Stage
  • Grade
  • A_Stage
  • Tumor_ Size (tumor size)
  • Estrogen_ Status (estrogen receptor positive)
  • Progesterone_ Status (progesterone receptor positive)
  • Regional_Node_Examined
  • Reginol_Node_Positive
  • Survival_Months (survival time in months)

By dividing the baseline data table into two groups according to "Alive" and "Dead", we can see the population distribution and proportion of each feature.

2.2 reading and processing data

Read csv files through pandas to obtain data. Through head(), you can see that some columns are composed of strings. For example, each value in the "Race" column is composed of "Black" and "White" equivalents, so you need to encode them into numbers 0, 1, 2, etc. during survival analysis.

"""
Read data and process classified variable data, otherwise use cox The model will report an error
 When encoding the string of column values, a symbol is compiled into a digital code
 Because some features do not have size, such as Race,That's 0, that's 1, no problem, you can pass pd.factorize Simple processing.
for example T_Stage Obviously progressive, can be handled as T1 Is 0, T2 It's 1.
for example['T2', 'T1', 'T3', 'T4'] ->[1,0,2,3]
The failure event we are concerned about here is Dead,censored data  Alive,So“ Status"Need to be processed into Alive Is 0, Dead Is 1
"""
df = pd.read_csv('SEER Breast Cancer Dataset .csv')

for name in df.columns.values:
    if df[name].dtype == object:
        print(f'Column name:{name}')
        print(f"Value:{df[name].unique().tolist()}\n")

df['Race'] = pd.factorize(df['Race'])[0].astype(np.uint16)
df['Marital_Status'] = pd.factorize(df['Marital_Status'])[0].astype(np.uint16)

df['A_Stage'] = pd.factorize(df['A_Stage'])[0].astype(np.uint16)
df['T_Stage'] = df['T_Stage'].map({'T1':1, 'T2':2,'T3':3,'T4':4})
df['N_Stage'] = df['N_Stage'].map({'N1':0, 'N2':1,'N3':2})
df['Sixth_Stage'] = df['Sixth_Stage'].map({'IIA':0, 'IIB':1,'IIIA':2,'IIIB':3,'IIIC':4})
df['Grade'] = df['Grade'].map({'Well differentiated; Grade I':0, 
                                'Moderately differentiated; Grade II':1,
                                'Poorly differentiated; Grade III':2,
                                'Undifferentiated; anaplastic; Grade IV':3})
df['Estrogen_Status'] = df['Estrogen_Status'].map({'Positive':0, 'Negative':1})
df['Progesterone_Status'] = df['Progesterone_Status'].map({'Positive':0, 'Negative':1})


df['Status'] = df['Status'].map({'Alive':0, 'Dead':1})
df.head()

Column name:Race
Value:['Other (American Indian/AK Native, Asian/Pacific Islander)', 'White', 'Black']

Column name:Marital_Status
Value:['Married (including common law)', 'Divorced', 'Single (never married)', 'Widowed', 'Separated']

Column name:T_Stage
Value:['T2', 'T1', 'T3', 'T4']

Column name:N_Stage
Value:['N3', 'N2', 'N1']

Column name:Sixth_Stage
Value:['IIIC', 'IIIA', 'IIB', 'IIA', 'IIIB']

Column name:Grade
Value:['Moderately differentiated; Grade II', 'Poorly differentiated; Grade III', 'Well differentiated; Grade I', 'Undifferentiated; anaplastic; Grade IV']

Column name:A_Stage
Value:['Regional', 'Distant']

Column name:Estrogen_Status
Value:['Positive', 'Negative']

Column name:Progesterone_Status
Value:['Positive', 'Negative']

Column name:Status
Value:['Alive', 'Dead']
AgeRaceMarital_StatusT_StageN_StageSixth_StageGradeA_StageTumor_SizeEstrogen_StatusProgesterone_StatusRegional_Node_ExaminedReginol_Node_PositiveSurvival_MonthsStatus
04300224104000191110
1470021210450025920
267102012025004121
3461110010190026121
4631021210350021531

2.3 simple data analysis

Draw histogram and box diagram for data to show the relationship between some data and final events

"""
There were 4024 subjects, 3408 cases of deletion and 616 cases of death
"""
print(len(df))
print(df['Status'].value_counts(),'\n')

4024
0    3408
1     616
Name: Status, dtype: int64 
#Columnar distribution of different T (tumor size) stages, deletions and deaths
sns.countplot(x="T_Stage",hue="Status",data=df,)
<AxesSubplot:xlabel='T_Stage', ylabel='count'>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-mJ86lnlJ-1645703648858)(output_15_1.png)]

#Columnar distribution of different lymph node metastasis, deletion and death
sns.countplot(x="N_Stage",hue="Status",data=df,)
<AxesSubplot:xlabel='N_Stage', ylabel='count'>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-MVpYq555-1645703648859)(output_16_1.png)]

"""
Draw the age box diagram of deletion group and death group
 The median age of the death group was a little higher than that of the deletion group
"""
ax = df.boxplot(column='Age',by="Status",figsize=(6,6))
ax.set_xticklabels(["censor","death"])
[Text(1, 0, 'censor'), Text(2, 0, 'death')]

[the external chain picture transfer fails. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-52x0zxw0-1645703648860)(output_17_1.png)]

2.4 drawing life cycle diagram

30 patients were randomly selected in the data set and the life cycle diagram was drawn. Suppose the observation time is 96 months. In 96 months, 5 patients had failure events (death), and most of the remaining patients lost contact / could not be followed up before 96 months (this data belongs to deleted data). For example, patient 1347 in the figure lost contact with the patient at the 62nd month of follow-up and did not know the real survival time of the patient.

Through the life cycle diagram, we can see that many data are deleted data. Whether the average survival time of patients with failure events is calculated separately without combining the deleted data, or the average survival time of patients is calculated together with the deleted data, this calculation underestimates the real survival time, It can be recognized that the right censored data has a great impact on the estimation of survival time, so survival analysis is to solve this kind of problem.

from lifelines.plotting import plot_lifetimes
df_sample = df.sample(n=30,random_state=10)
df_sample['Status'] = np.where(df_sample['Status']== 1,True,False)
plt.figure(figsize=[8,6])
ax = plot_lifetimes(df_sample['Survival_Months'], event_observed=df_sample['Status'],sort_by_duration=False)
ax.vlines(96, 0, 30, lw=2, linestyles='--',colors='black')
plt.xlabel("Months")

3. Start survival analysis

3.1 life table analysis method

The life table divides the events into time periods (months and years), and counts the occurrence of events according to the time period, so as to calculate the survival rate. It is generally used for follow-up data with large sample and large time span (such as 1 year and 2 years). For some reasons, the research object is not contacted every day, but the data of the object is obtained in a certain time interval, such as a year later. Therefore, it is not accurate to record the occurrence time of the outcome, but only know that there is an end event in a certain interval. In this case, the life table can be drawn and the survival rate can be calculated for analysis.

It is assumed that there is a life table (similar frequency table) as follows:

Years of diagnosisdeath tollNumber of deletedInitial number
0~110100
1~21099
2~33198
3~41194
4~52092
5~38790

The death probability and survival probability and survival rate can be calculated through the life table

Probability of death = number of deaths / (initial number - number of deleted / 2)

Survival probability = 1-death probability

Survival rate = survival probability 1 X survival probability 2 X survival probability 3..

For example, the death probability of interval 2-3 = 3 / (98-1 / 2) = 0.0307.

Survival probability = 1-0.0307 = 0.969

Survival rate = 0.99 * 0.9899 * 0.969 = 0.949

The survival curve can be drawn by the calculated survival rate

Years of diagnosisdeath tollNumber of deletedInitial numberProbability of deathSurvival probabilitysurvival rate
0~1101000.010.990.99
1~210990.01010.98990.9799
2~331980.03070.9690.949
3~411940.0170.9890.938
4~520920.0220.9780.918
5~387900.06450.93550.859

In the single factor analysis, we can group according to the influencing factors, make the life table respectively, calculate each index and draw the survival curve for comparison.

3.2 breast cancer dataset generating life table

Calculate the survival rate and draw the survival curve

from lifelines.utils import survival_table_from_events

#The original data is that the survival time is in months, but now it is calculated in years, and then the life table is made
def months_to_year(x):
    if x % 12 ==0:
        return x /12
    else:
        return int(x/12)+1 
df['year'] = df['Survival_Months'].map(months_to_year)
T = df['year']
E = df['Status']
survival_table_from_events(T, E,columns=['removed', 'observed', 'censored', 'entrance', 'at_risk'])
removedobservedcensoredentranceat_risk
event_at
0.000040244024
1.071472404024
2.0114882603953
3.01171001703839
4.022511810703722
5.073910563403497
6.07256066502758
7.07315467702033
8.06743364101302
9.0628116170628
event_atremovedobservedcensoredentranceat_risk
0.000040244024
1.071472404024
2.0114882603953
3.01171001703839
4.022511810703722
5.073910563403497
6.07256066502758
7.07315467702033
8.06743364101302
9.0628116170628

The death probability and survival probability and survival rate are calculated through the above life table

event_atremovedobservedcensoredentranceat_riskprobability of deathprobability of survivingsurvival rate
0.000040244024011
1.0714724040240.01170.98830.9883
2.01148826039530.02230.97770.9663
3.011710017038390.02610.97390.941
4.0225118107037220.032170.96780.911
5.0739105634034970.0330.9670.881
6.072560665027580.02470.9750.858
7.073154677020330.03190.96810.8313
8.067433641013020.03360.96640.803
9.06281161706280.03440.96560.776
#Then draw the survival curve through the calculated survival rate
x = [0,1,2,3,4,5,6,7,8,9]
y = [1, 0.9883, 0.9663, 0.941,0.911,0.881,0.858,0.8313,0.803,0.776]
plt.figure(figsize=(8,8))
plt.title("survival curve") 
plt.step(x,y,color="#8dd3c7", where="pre",lw=2)
plt.xlim(0,9)
plt.ylim(0.7,1)
plt.xlabel("year")
plt.ylabel('S(t)')
plt.show()

[the external chain picture transfer fails. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-m5vjpVYP-1645703648861)(output_25_0.png)]

3.3 Kaplan Meier method

The above attempts to calculate the survival rate and draw the survival curve through the life table. When estimating the survival rate, the product limit method is actually used, that is, Kaplan Meier method, referred to as K-M method, and K-M method is a nonparametric method

Survival rate S(t), when t=3, means the survival rate of the research object when it has lived for more than 3 years, which is calculated on the premise that the object has lived for more than 2 years, while knowing the survival rate of more than 2 years is calculated on the premise that the research object has lived for more than 1 year.

So s (t = 3) = P1 * P2 * P3, P is the survival probability of each year.

The survival time of the original breast cancer data set is in the monthly unit, and the sample size is more than 4000. Now using the Kaplan meierfitter tool in the lifelines library, you can draw the survival curve using the K-M method.

Draw the survival curve by grouping to show the survival curve of the research object under different factors, so as to conduct single factor analysis.

The KM method can also calculate the median survival time. The median survival time is the time t corresponding to the ordinate = 0.5 (i.e. the cumulative survival rate = 0.5). When drawing multiple groups of survival curves, the differences between groups can be simply compared through the median survival time.


For example, in this figure, the survival time of the blue curve is about 1200, and that of the Yellow curve is 2100. The survival time of the Yellow curve is longer than that of the blue curve.

#KM analysis was used, and the survival curve was drawn and the median survival time was calculated
from lifelines import KaplanMeierFitter
from lifelines.utils import median_survival_times
kmf = KaplanMeierFitter()
kmf.fit(durations=df['Survival_Months'],
       event_observed=df['Status'])

print("Generation probability:")
print(kmf.survival_function_)
print("Intermediate survival probability=0.5 When, 95%section")  
median_confidence_interval = median_survival_times(kmf.confidence_interval_)
print(median_confidence_interval)  #Because the survival probability was not lower than 0.5 during the study period, there was no time corresponding to the intermediate survival probability
plt.figure(figsize=[10,10])
kmf.plot_survival_function(at_risk_counts=True)#show_ Licensors = true, whether to display deletion, at_risk_counts=True show risk count

Generation probability:
          KM_estimate
timeline             
0.0          1.000000
1.0          1.000000
2.0          0.999503
3.0          0.999006
4.0          0.996767
...               ...
103.0        0.785709
104.0        0.785709
105.0        0.785709
106.0        0.785709
107.0        0.785709

[108 rows x 1 columns]
Intermediate survival probability=0.5 When, 95%section
     KM_estimate_lower_0.95  KM_estimate_upper_0.95
0.5                     inf                     inf





<AxesSubplot:xlabel='timeline'>

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-qkNumVLL-1645703648863)(output_27_2.png)]

#Print the short statistical content of the survival curve (life table). Including observations, number of events, number of deletions, survival rate and its confidence interval.
pd.concat([kmf.event_table,kmf.survival_function_,kmf.confidence_interval_],axis=1)
removedobservedcensoredentranceat_riskKM_estimateKM_estimate_lower_0.95KM_estimate_upper_0.95
0.0000402440241.0000001.0000001.000000
1.0101040241.0000001.0000001.000000
2.0321040230.9995030.9980140.999876
3.0422040200.9990060.9973530.999627
4.01091040160.9967670.9944380.998121
...........................
103.05005002510.7857090.7658720.804086
104.04804802010.7857090.7658720.804086
105.04504501530.7857090.7658720.804086
106.04704701080.7857090.7658720.804086
107.0610610610.7857090.7658720.804086

108 rows × 8 columns

3.4 the data were grouped and the single factor survival analysis was carried out by KM method

KM method can be used to analyze the survival rate of the research object of the whole data. Sometimes, in order to study whether the drug brings therapeutic effect to the disease, the experimental group and the control group will be set. When using KM analysis method, the survival curves of the two groups can be drawn to show the difference between the two groups, or the median survival time can be calculated. Assuming that the median survival time of the experimental group is greater than that of the control group, it shows that the drug can delay the life of patients.

However, KM method can only group classified variables. Corresponding to continuous values, continuous values need to be partitioned and then grouped

Comparing the survival differences between the two groups can be compared by median survival time. You can also specify a time point to calculate whether there is a significant difference between the two groups of curves. The area under the survival curve in T time can also be calculated. The differences between different survival curves can be compared by the area under the curve (restricted mean survival time / RMST). Survival differences between the two groups can also be compared by calculating the p-th percentile survival time

"""
yes T-Stage Tumor size was grouped and survival curves were drawn
 It is obvious from the survival curve T4 Patients with large tumors)The survival rate of decreased rapidly
"""
from lifelines import KaplanMeierFitter
plt.figure(figsize=(8,8))
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('T_Stage'):
    kmf.fit(grouped_df["Survival_Months"], grouped_df["Status"], label=name)
    kmf.plot_survival_function()


[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-8mLgXVID-1645703648864)(output_30_0.png)]

"""
Because age is a continuous value, km The method is only applicable to categorical variables, because the ages of continuous values need to be grouped in intervals
 Age was grouped and survival curves were drawn
"""
from lifelines import KaplanMeierFitter
df['age_'] = pd.cut(df['Age'],[0,40,50,60])
plt.figure(figsize=(8,8))
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('age_'):
    kmf.fit(grouped_df["Survival_Months"], grouped_df["Status"], label=name)
    kmf.plot_survival_function()

[the external chain picture transfer fails. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-NVKc0lxQ-1645703648865)(output_31_0.png)]

"""
The difference in survival between the two groups can be compared by median survival time
(Because of this data, the survival rate of some groups is greater than 0 in the end.5,Unable to calculate the health storage time in, so it is displayed inf)
"""

from lifelines.utils import median_survival_times
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('T_Stage'):
    kmf_temp = KaplanMeierFitter().fit(grouped_df["Survival_Months"], grouped_df["Status"])
    median = median_survival_times(kmf_temp)
    print(f"{name}Median survival time of group:{median}")
T1 Median survival time of group:inf
T2 Median survival time of group:inf
T3 Median survival time of group:inf
T4 Median survival time of group:inf
"""
Some data showed that the survival rate was greater than 0 at the end of follow-up.5,That is, the median survival time cannot be calculated.
However, you can specify a third party qth Quantile survival time to obtain the corresponding survival time
"""

from lifelines.utils import qth_survival_time
qth = 0.9
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('T_Stage'):
    kmf_temp = KaplanMeierFitter().fit(grouped_df["Survival_Months"], grouped_df["Status"])
    time = qth_survival_time(qth,kmf_temp)
    print(f"{name}Group No{qth*100}Quantile survival time:{time}")
T1 Group 90.0 Quantile survival time:81.0
T2 Group 90.0 Quantile survival time:49.0
T3 Group 90.0 Quantile survival time:40.0
T4 Group 90.0 Quantile survival time:25.0


/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lifelines/fitters/__init__.py:273: ApproximationWarning: Approximating using `survival_function_`. To increase accuracy, try using or increasing the resolution of the timeline kwarg in `.fit(..., timeline=timeline)`.

  exceptions.ApproximationWarning,
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lifelines/fitters/__init__.py:273: ApproximationWarning: Approximating using `survival_function_`. To increase accuracy, try using or increasing the resolution of the timeline kwarg in `.fit(..., timeline=timeline)`.

  exceptions.ApproximationWarning,
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lifelines/fitters/__init__.py:273: ApproximationWarning: Approximating using `survival_function_`. To increase accuracy, try using or increasing the resolution of the timeline kwarg in `.fit(..., timeline=timeline)`.

  exceptions.ApproximationWarning,
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/lifelines/fitters/__init__.py:273: ApproximationWarning: Approximating using `survival_function_`. To increase accuracy, try using or increasing the resolution of the timeline kwarg in `.fit(..., timeline=timeline)`.

  exceptions.ApproximationWarning,
"""
The survival difference corresponding to this time can be compared by specifying the time (for example, the survival difference in the 60th month I am concerned about))
"""
from lifelines.statistics import survival_difference_at_fixed_point_in_time_test
from lifelines.datasets import load_waltons

T = df['Survival_Months']
E = df['Status']
ix = df['A_Stage']== 0
kmf_1 = KaplanMeierFitter().fit(T[ix],  E[ix])
kmf_2 = KaplanMeierFitter().fit(T[~ix],  E[~ix])

point_in_time = 60. #Specify when to compare
results = survival_difference_at_fixed_point_in_time_test(point_in_time, kmf_1, kmf_2)
results.print_summary()
null_distributionchi squared
degrees_of_freedom1
point_in_time60
fitterA<lifelines.KaplanMeierFitter:"KM_estimate", fi...
fitterB<lifelines.KaplanMeierFitter:"KM_estimate", fi...
test_namesurvival_difference_at_fixed_point_in_time_test
test_statisticp-log2(p)
040.84<0.00532.49
"""
Restricted mean survival time/RMST
 The survival difference was compared by calculating the area under different survival curves
"""
from lifelines.utils import restricted_mean_survival_time
from lifelines.plotting import rmst_plot

T = df['Survival_Months']
E = df['Status']
ix = df['A_Stage']== 0 #0 is' Regional '

time_limit = 60#Within the specified time

kmf_1 = KaplanMeierFitter().fit(T[ix], E[ix])
rmst_1 = restricted_mean_survival_time(kmf_1, t=time_limit)
print(f"Area under curve:{rmst_1}")

kmf_2 = KaplanMeierFitter().fit(T[~ix], E[~ix])
rmst_2 = restricted_mean_survival_time(kmf_2, t=time_limit)
print(f"Area under curve:{rmst_2}")

#Plot the area of the curve
plt.figure(figsize=(10,12))
ax = plt.subplot(311)
rmst_plot(kmf_1, t=time_limit, ax=ax)
ax = plt.subplot(312)
rmst_plot(kmf_2, t=time_limit, ax=ax)
ax = plt.subplot(313)
rmst_plot(kmf_1, model2=kmf_2, t=time_limit, ax=ax)
Area under curve:57.22662022019032
 Area under curve:49.82516339891577





<AxesSubplot:xlabel='timeline'>

[the external chain picture transfer fails. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-lqgOVtUY-1645703648866)(output_35_2.png)]

3.5 log rank test

After using KM method to draw multiple survival curves according to groups, it is not sufficient to determine whether there are significant differences between multiple curves only through direct observation. Log rank test makes up for this.

There was no significant difference in the distribution of survival time between the zero hypothesis group and the group of log rank test. When the calculated P value is less than 0.05, the null hypothesis is overturned and it is considered that there is a significant difference in the distribution of survival time between the two groups.

"""
The survival curves of the two groups were compared
 result A_Stage Group P Value less than 0.005,There are significant differences(Direct observation of the graph also shows significant differences)
"""
from lifelines.statistics import logrank_test
from lifelines import KaplanMeierFitter
plt.figure(figsize=(8,8))
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('A_Stage'):
    kmf.fit(grouped_df["Survival_Months"], grouped_df["Status"], label=name)
    kmf.plot_survival_function()

T = df['Survival_Months']
E = df['Status']
ix = df['A_Stage']== 'Distant'
result = logrank_test(T[ix],T[~ix], E[ix],E[~ix])
result.print_summary()

t_0-1
null_distributionchi squared
degrees_of_freedom0
test_namelogrank_test
test_statisticp-log2(p)
00.00nannan

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-R66tN956-1645703648868)(output_37_1.png)]

"""
Compare whether there is significant difference in the survival curve between the two groups
 result T_Stage Group P Value less than 0.005,There are significant differences(Direct observation of the graph also shows significant differences)
"""

from lifelines.statistics import multivariate_logrank_test
kmf = KaplanMeierFitter()
for name, grouped_df in df.groupby('N_Stage'):
    kmf.fit(grouped_df["Survival_Months"], grouped_df["Status"], label=name)
    kmf.plot_survival_function()
results = multivariate_logrank_test(df['Survival_Months'], df['N_Stage'], df['Status'])
results.print_summary()
print(results.p_value)
t_0-1
null_distributionchi squared
degrees_of_freedom2
test_namemultivariate_logrank_test
test_statisticp-log2(p)
0298.86<0.005215.58
1.2701885843852175e-65

[the external chain picture transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (IMG rnjouauur-1645703648869) (output_38_2. PNG)]

3.6 parameter estimation

If the distribution of survival time is known in advance, the parameter method can be used: the parameters in the assumed distribution model can be estimated according to the sample observations to obtain the probability distribution model of survival time.
The distributions of survival time are exponential distribution, Weibull distribution, lognormal distribution, loglogistic distribution and Gamma distribution. The analysis process is similar to KM method. Because the distribution of survival time of real data is often unknown, all parameter estimation is not a common survival analysis method.

"""
Simple display if used linelifes To use the parameter method for survival analysis and draw the survival curve.
"""
from lifelines.datasets import load_waltons
from lifelines.utils import median_survival_times
import matplotlib.pyplot as plt
import numpy as np
from lifelines import *
# Data loading
T = df['Survival_Months']
E = df['Status']

fig, axes = plt.subplots(3, 3, figsize=(13.5, 7.5))

# Multi parameter model
kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentialFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
ggf = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
sf = SplineFitter(np.percentile(T.loc[E.astype(bool)], [0, 50, 100])).fit(T, E, label='SplineFitter')

wbf.plot_survival_function(ax=axes[0][0])
exf.plot_survival_function(ax=axes[0][1])
lnf.plot_survival_function(ax=axes[0][2])
kmf.plot_survival_function(ax=axes[1][0])
llf.plot_survival_function(ax=axes[1][1])
pwf.plot_survival_function(ax=axes[1][2])
ggf.plot_survival_function(ax=axes[2][0])
sf.plot_survival_function(ax=axes[2][1])

<AxesSubplot:>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-RJkcN4V3-1645703648869)(output_40_1.png)]

3.7 proportional hazards regression model – Cox model

Cox proportional hazards regression model was proposed by the British statistician D.R.Cox (unfortunately, D.R.Cox died in January 2022) in 1972 for the basic prognostic analysis of tumors.

The formula of Cox proportional regression is as follows:


Where, T represents the survival time, x is the covariate, b is the regression coefficient of the covariate, h0 is the benchmark risk, and represents the risk rate of death at time t when all covariates are 0. h(t) is the risk function of the research object, which can be understood as time t death risk. exp(bi) is the risk ratio of xi of the covariate, the covariate with the value of risk ratio greater than 1 is the risk factor, and the covariate with the value greater than 1 is the protective factor.

Cox proportional hazards regression model belongs to the semi parametric method, which can estimate the survival function, compare two or more groups of survival distribution functions, and can be single factor analysis or multi factor analysis. Univariate analysis includes classified variables and continuous variables (KM method can only analyze classified variables). The relationship model between survival time and influencing factors can be established to calculate the risk ratio of influencing factors. Note that Cox regression requires to meet the proportional risk assumption, that is, the influencing factors do not change with time. If the variables are not satisfied, the variables should be stratified and then Cox regression.

3.8 divide the data into training set and verification set

Firstly, delete the missing rows in the data, and then 8:2 divide the data into training set and verification set

#Load data
df = pd.read_csv('SEER Breast Cancer Dataset .csv')

df['Race'] = pd.factorize(df['Race'])[0].astype(np.uint16)
df['Marital_Status'] = pd.factorize(df['Marital_Status'])[0].astype(np.uint16)

df['A_Stage'] = pd.factorize(df['A_Stage'])[0].astype(np.uint16)
df['T_Stage'] = df['T_Stage'].map({'T1':1, 'T2':2,'T3':3,'T4':4})
df['N_Stage'] = df['N_Stage'].map({'N1':0, 'N2':1,'N3':2})
df['Sixth_Stage'] = df['Sixth_Stage'].map({'IIA':0, 'IIB':1,'IIIA':2,'IIIB':3,'IIIC':4})
df['Grade'] = df['Grade'].map({'Well differentiated; Grade I':0, 
                                'Moderately differentiated; Grade II':1,
                                'Poorly differentiated; Grade III':2,
                                'Undifferentiated; anaplastic; Grade IV':3})
df['Estrogen_Status'] = df['Estrogen_Status'].map({'Positive':0, 'Negative':1})
df['Progesterone_Status'] = df['Progesterone_Status'].map({'Positive':0, 'Negative':1})
df['Status'] = df['Status'].map({'Alive':0, 'Dead':1})
#Delete rows with missing values
df = df.dropna(axis=0)

#Split training set and verification set
df_train = df.sample(frac=0.8,random_state=2022)
df_val = df.drop(index = df_train.index)
print(f"Training set size:{len(df_train)}")
print(f"Validation set size:{len(df_val)}")
print("*"*20)
#Count the number of failure events and deleted data
print(df_train['Status'].value_counts())
print(df_val['Status'].value_counts())

Training set size: 3219
 Verification set size: 805
********************
0    2740
1     479
Name: Status, dtype: int64
0    668
1    137
Name: Status, dtype: int64

3.9 build Cox model and start training

Through print_ The summary () model outputs three tables

Table 1: duration col is the time column name, event col is the event column name, breslow is the model parameter estimation method, 3219 is the total data volume, and 479 is the number of failure events

modellifelines.CoxPHFitter
duration col'Survival_Months'
event col'Status'
baseline estimationbreslow
number of observations3219
number of events observed479
partial log-likelihood-3516.57
time fit was run2022-02-06 03:27:20 UTC

Table 2: coef is the regression coefficient of the covariate, exp(coef) is the risk ratio, se(coef) is the standard error of the coefficient, 95% confidence zone of the standard error, 95% confidence zone of the hazard ratio, Wald statistical value (z), whether the P value of Wald statistical value is significant.

How to interpret the risk ratio? When the risk ratio is greater than 1, it is a risk factor, the increase of value will increase the death risk of patients, and less than 1 is a protective factor to reduce the death risk of patients.

For example, the p value of Age is less than 0.005, the risk ratio HR = 1.02, and the 95% confidence interval is 1.01-1.03, indicating a relationship between Age value and increased risk of death. Keeping other covariates unchanged, Age increased by one unit and the patient's risk increased by 2%.

coefexp(coef)se(coef)coef lower 95%coef upper 95%exp(coef) lower 95%exp(coef) upper 95%zp-log2§
Age0.021.020.010.010.031.011.033.09<0.0058.95
Race0.361.430.120.130.591.141.803.09<0.0058.96
Marital_Status0.071.080.04-0.010.160.991.171.760.083.69

Table 3: Concordance is the consistency index. All covariates are used to train the cox model, and the consistency index is 0.75

Concordance0.75
Partial AIC7059.14
log-likelihood ratio test398.31 on 13 df
-log2§ of ll-ratio test253.44
"""
Univariate analysis
"""
from lifelines import CoxPHFitter
varList = df_train.columns.to_list()
varList.pop()#Remove time column
varList.pop()#Remove the status column

#Create Cox regression model
cph = CoxPHFitter()
for var in varList:
    cph.fit(df=df_train,duration_col='Survival_Months',event_col='Status',formula=var)
    print(f"{var} of P Value:{cph.summary.p[var]},<0.05 {cph.summary.p[var]<0.05}")

Age of P Value: 0.0248649940872201,<0.05 True
Race of P Value: 1.3550638342845673e-06,<0.05 True
Marital_Status of P Value: 8.68912013912815e-05,<0.05 True
T_Stage of P Value: 2.2304661752517098e-20,<0.05 True
N_Stage of P Value: 1.6617003098137667e-45,<0.05 True
Sixth_Stage of P Value: 1.0290156463874445e-47,<0.05 True
Grade of P Value: 6.022575977628071e-23,<0.05 True
A_Stage of P Value: 1.0633163182382273e-08,<0.05 True
Tumor_Size of P Value: 5.79647504244188e-16,<0.05 True
Estrogen_Status of P Value: 1.0415072293170037e-27,<0.05 True
Progesterone_Status of P Value: 5.007000090288116e-25,<0.05 True
Regional_Node_Examined of P Value: 0.1065243965415506,<0.05 False
Reginol_Node_Positive of P Value: 1.5269471157224244e-46,<0.05 True
"""
Multivariate analysis
"""
from lifelines import CoxPHFitter
#Create Cox regression model
cph = CoxPHFitter()
#df incoming DataFrame format data, duration_col column name of incoming time, event_col the column name of the incoming event. By default, all covariates are used
#Some covariates can be passed in through the parameter formula='Age+Race '
cph.fit(df=df_train,duration_col='Survival_Months',event_col='Status',show_progress=True)
#Print model
cph.print_summary()
Iteration 1: norm_delta = 1.07058, step_size = 0.9000, log_lik = -3715.72192, newton_decrement = 255.86131, seconds_since_start = 0.2
Iteration 2: norm_delta = 0.47681, step_size = 0.9000, log_lik = -3575.51461, newton_decrement = 52.18325, seconds_since_start = 0.4
Iteration 3: norm_delta = 0.12309, step_size = 0.9000, log_lik = -3520.88106, newton_decrement = 3.96398, seconds_since_start = 0.6
Iteration 4: norm_delta = 0.01909, step_size = 1.0000, log_lik = -3516.63264, newton_decrement = 0.06359, seconds_since_start = 0.9
Iteration 5: norm_delta = 0.00045, step_size = 1.0000, log_lik = -3516.56817, newton_decrement = 0.00003, seconds_since_start = 1.1
Iteration 6: norm_delta = 0.00000, step_size = 1.0000, log_lik = -3516.56814, newton_decrement = 0.00000, seconds_since_start = 1.3
Convergence success after 6 iterations.
modellifelines.CoxPHFitter
duration col'Survival_Months'
event col'Status'
baseline estimationbreslow
number of observations3219
number of events observed479
partial log-likelihood-3516.57
time fit was run2022-02-23 17:30:27 UTC
coefexp(coef)se(coef)coef lower 95%coef upper 95%exp(coef) lower 95%exp(coef) upper 95%zp-log2(p)
Age0.021.020.010.010.031.011.033.09<0.0058.95
Race0.361.430.120.130.591.141.803.09<0.0058.96
Marital_Status0.071.080.04-0.010.160.991.171.760.083.69
T_Stage0.371.450.110.160.581.171.793.42<0.00510.63
N_Stage0.361.430.180.020.701.022.022.050.044.63
Sixth_Stage-0.001.000.11-0.230.220.801.25-0.010.990.01
Grade0.471.600.080.320.631.371.876.02<0.00529.06
A_Stage0.001.000.21-0.410.410.661.510.010.990.01
Tumor_Size-0.001.000.00-0.010.000.991.00-0.640.520.93
Estrogen_Status0.541.720.150.240.841.282.313.58<0.00511.50
Progesterone_Status0.551.730.120.310.781.372.184.58<0.00517.71
Regional_Node_Examined-0.030.970.01-0.05-0.020.950.98-4.69<0.00518.49
Reginol_Node_Positive0.051.050.010.030.081.031.084.10<0.00514.58

Concordance0.75
Partial AIC7059.14
log-likelihood ratio test398.31 on 13 df
-log2(p) of ll-ratio test253.44
#Draw the HR value and 95% confidence interval of each covariate (forest diagram)
plt.figure(figsize=(8,8))
cph.plot(hazard_ratios=True)

"""
Combine the above print_summary()The table shows that 95% of covariates%The confidence interval crosses the covariate of 1 on the abscissa axis p Values are greater than 0.05
"""
'\n Combine the above print_summary()The table shows that 95% of covariates%The confidence interval crosses the covariate of 1 on the abscissa axis p Values are greater than 0.05\n'

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-hxYOl86A-1645703648871)(output_47_1.png)]

#Remove the insignificant variables and re fit the cox model
FORMULA = 'Age + Race  + T_Stage + N_Stage+Grade+Estrogen_Status+Progesterone_Status+Regional_Node_Examined+Reginol_Node_Positive'
cph.fit(df=df_train,duration_col='Survival_Months',event_col='Status',formula=FORMULA)
#Print model
cph.print_summary()
modellifelines.CoxPHFitter
duration col'Survival_Months'
event col'Status'
baseline estimationbreslow
number of observations3219
number of events observed479
partial log-likelihood-3518.31
time fit was run2022-02-23 17:30:49 UTC
coefexp(coef)se(coef)coef lower 95%coef upper 95%exp(coef) lower 95%exp(coef) upper 95%zp-log2(p)
Age0.021.020.010.010.031.011.033.32<0.00510.13
Estrogen_Status0.551.730.150.250.841.292.333.65<0.00511.87
Grade0.471.610.080.320.631.381.876.06<0.00529.43
N_Stage0.351.420.090.160.531.181.713.68<0.00512.05
Progesterone_Status0.551.730.120.310.781.372.184.57<0.00517.68
Race0.401.490.120.170.621.191.873.45<0.00510.78
Reginol_Node_Positive0.051.050.010.030.081.031.084.31<0.00515.89
Regional_Node_Examined-0.030.970.01-0.05-0.020.950.98-4.68<0.00518.42
T_Stage0.331.390.060.220.441.241.565.80<0.00527.13

Concordance0.75
Partial AIC7054.62
log-likelihood ratio test394.83 on 9 df
-log2(p) of ll-ratio test261.63
#Draw survival curve S(t)
cph.baseline_survival_.plot()
<AxesSubplot:>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-AXbYqi6u-1645703648872)(output_49_1.png)]

#Cumulative risk curve H(t)
cph.baseline_cumulative_hazard_.plot()
<AxesSubplot:>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Ai7eyoCy-1645703648873)(output_50_1.png)]

#Because the cumulative risk function H(t) = -lnS(t)
#Now calculate H(t) and draw the curve. The result is consistent with the figure above.
(-np.log(cph.baseline_survival_)).plot()
<AxesSubplot:>

[the external chain picture transfer fails, and the source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-glexcvsvj-1645703648874) (output_51_1. PNG)]

#Draw the risk function h(t) corresponding to each time point
cph.baseline_hazard_.plot()

<AxesSubplot:>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-u17GVfeN-1645703648874)(output_52_1.png)]

print(cph.predict_partial_hazard(pd.DataFrame(df_val.iloc[0]).T))#Predicting individual risk
print(cph.predict_percentile(pd.DataFrame(df_val.iloc[10]).T,p=0.9))#Predict the survival time of the 90th quantile
print(cph.predict_median(pd.DataFrame(df_val.iloc[10]).T))#Predict the median survival time of individuals
print(cph.predict_cumulative_hazard(df_val.iloc[0]))#Predicting individual cumulative risk function
print(cph.predict_survival_function(df_val.iloc[0]))#Predicting individual survival function

1    0.573792
dtype: float64
96.0
inf
              1
1.0    0.000000
2.0    0.000118
3.0    0.000353
4.0    0.001059
5.0    0.001414
...         ...
103.0  0.103245
104.0  0.103245
105.0  0.103245
106.0  0.103245
107.0  0.103245

[107 rows x 1 columns]
              1
1.0    1.000000
2.0    0.999882
3.0    0.999647
4.0    0.998941
5.0    0.998587
...         ...
103.0  0.901906
104.0  0.901906
105.0  0.901906
106.0  0.901906
107.0  0.901906

[107 rows x 1 columns]
from lifelines.utils import concordance_index
#Calculate c-index
#Method 1
print(f"Training set c-index:{cph.score(df_train, scoring_method='concordance_index')}")
print(f"Validation set c-index:{cph.score(df_val, scoring_method='concordance_index')}")
#Method 2
print(concordance_index(df_train['Survival_Months'], -cph.predict_partial_hazard(df_train), df_train['Status']))
#Predict the survival rate of a study object in the validation set
cph.predict_survival_function(df_val.iloc[0])
Training set c-index:0.7460170801920486
 Validation set c-index:0.7108462839198003
0.7460166658656356
1
1.01.000000
2.00.999882
3.00.999647
4.00.998941
5.00.998587
......
103.00.901906
104.00.901906
105.00.901906
106.00.901906
107.00.901906

107 rows × 1 columns

"""
Can pass k Fold cross validation, try the combination of all covariates to find the best combination
 There are more than 8190 combinations here, and the running time will be very long. There is no need to run them and the whole project will not be affected
"""
from lifelines.utils import k_fold_cross_validation
from itertools import combinations

def combine(temp_list, n):
    '''according to n Get all possible combinations in the list( n (elements as a group)'''
    temp_list2 = []
    for c in combinations(temp_list, n):
        temp_list2.append(c)
    return temp_list2
varList = df_train.columns.to_list()
varList.pop()#Remove "Status"
varList.pop()#Remove "Survival_Months"

end_list = []
for i in range(len(varList)):
    if combine(varList, i)!=[()]:
        end_list.extend(combine(varList, i))
#Print combination quantity
print(len(end_list))
mean_score = []#Save the c-index score of k-fold crossover of each combination
for i in set(end_list):
    columns = list(i)+['Survival_Months','Status']
    df_temp = df_train.loc[:,columns]
    scores = k_fold_cross_validation(CoxPHFitter(), df_temp, 'Survival_Months', event_col='Status', k=10,scoring_method="concordance_index")
    mean_score.append(np.mean(scores))
#Print highest score
print(np.max(mean_score))
#Print the covariate combination with the highest score
print(end_list[mean_score.index(np.max(mean_score))])

3.10 how does the covariate value affect the survival curve

"""
The baseline survival curve of the model was compared with what happened when the values of covariates in a group changed.
When we change covariates(s)When other conditions are the same, this helps to compare the survival of subjects.
"""

cph.plot_partial_effects_on_outcome('T_Stage',values=[1,2,3,4])
<AxesSubplot:>

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-Q7TFBpGb-1645703648875)(output_57_1.png)]

"""
Observe how the values of multiple covariates affect the survival curve
"""

cph.plot_partial_effects_on_outcome(['T_Stage', 'N_Stage'], values=[[0, 0], [1, 0], [1,1], [1,2]], cmap='coolwarm')
<AxesSubplot:>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-oRVpOhzx-1645703648876)(output_58_1.png)]

#Draw a smooth calibration curve
from lifelines.calibration import survival_probability_calibration
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
#Calibration curve of the training set, t0 (float) – time to evaluate the probability of occurrence of the event
survival_probability_calibration(cph,df_train,t0=36,ax=axes[0])
#Calibration curve of validation set
survival_probability_calibration(cph,df_val,t0=36,ax=axes[1])
"""
ICI – Predicted and observed mean absolute difference
E50 – Median absolute difference between predicted and observed
"""
ICI =  0.0033427008106428234
E50 =  0.0022028810234795415
ICI =  0.008988198843408044
E50 =  0.008933834566536292





'\nICI – Predicted and observed mean absolute difference\nE50 – Median absolute difference between predicted and observed\n'

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-qDMleuTf-1645703648877)(output_59_2.png)]

3.11 testing proportional hazards (PH) assumptions

As mentioned before, the Cox regression model needs to meet the proportional risk assumption, that is, the covariate does not change with time, for example, whether a covariate is treated with a certain drug, but this treatment does not occur only once, but the amount of treatment will increase with time. Unlike marital status, race, gender and other covariates do not change over time.

[how to test PH hypothesis]

If PH assumption is satisfied or not, the cox regression model established needs special treatment, which can be checked here_ The assumptions () tool is used to test whether all covariates meet the PH hypothesis. This tool will give the P value of statistical test through the statistical test method, and judge through the P value. Generally, the covariate with P less than 0.05 does not meet the proportional risk sharing assumption.
check_ The output of the assets () tool has three parts

Part I:

test_statisticp-log2§
A_Stagekm4.910.035.23
rank4.430.044.83
Estrogen_Statuskm4.990.035.29
rank6.110.016.22

The null hypothesis variable satisfies the PH hypothesis. When the P value is less than 0.05, reject the null hypothesis, that is, this variable does not satisfy the PH hypothesis, such as estrogen in the table_ Status variable.

Part II:

2. Variable 'Estrogen_Status' failed the non-proportional test: p-value is 0.0134.

   Advice: with so few unique values (only 2), you can include `strata=['Estrogen_Status', ...]` in
the call in `.fit`. See documentation in link [E] below.

   Bootstrapping lowess lines. May take a moment...


It is pointed out that those covariates do not meet the PH assumption, and the solution is given.

Part III:

Estrogen_ Schoenfeld residuals diagram of status variable

If the variable satisfies the PH assumption, the black solid line (variation of variable influence coefficient with time) in the figure should be parallel to the blue horizontal dotted line. On the contrary, as shown in the figure, the black solid line also changes with time.

[how to solve the problem if PH assumption is not met]

  1. Method 1: [layering] layering the variable that does not meet the PH assumption, and cox modeling the remaining covariates

  2. Method 2: [time dependent variable covariate]

If the PH assumption is not satisfied, Cox regression with time-dependent covariates should be used to fit the model.

cph.check_assumptions(df_train, p_value_threshold=0.05, show_plots=True)
The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.
null_distributionchi squared
degrees_of_freedom1
model<lifelines.CoxPHFitter: fitted with 3219 total...
test_nameproportional_hazard_test
test_statisticp-log2(p)
Agekm3.220.073.78
rank4.690.035.04
Estrogen_Statuskm4.990.035.29
rank6.110.016.22
Gradekm0.030.860.22
rank0.010.910.13
N_Stagekm1.310.251.99
rank1.260.261.93
Progesterone_Statuskm10.13<0.0059.42
rank9.77<0.0059.14
Racekm0.220.640.65
rank0.070.790.34
Reginol_Node_Positivekm0.850.361.49
rank0.840.361.48
Regional_Node_Examinedkm0.000.950.07
rank0.040.840.25
T_Stagekm0.000.950.07
rank0.030.860.21
1. Variable 'Age' failed the non-proportional test: p-value is 0.0304.

   Advice 1: the functional form of the variable 'Age' might be incorrect. That is, there may be
non-linear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.

   Advice 2: try binning the variable 'Age' using pd.cut, and then specify it in `strata=['Age',
...]` in the call in `.fit`. See documentation in link [B] below.

   Advice 3: try adding an interaction term with your time variable. See documentation in link [C]
below.


   Bootstrapping lowess lines. May take a moment...


2. Variable 'Estrogen_Status' failed the non-proportional test: p-value is 0.0134.

   Advice: with so few unique values (only 2), you can include `strata=['Estrogen_Status', ...]` in
the call in `.fit`. See documentation in link [E] below.

   Bootstrapping lowess lines. May take a moment...


3. Variable 'Progesterone_Status' failed the non-proportional test: p-value is 0.0015.

   Advice: with so few unique values (only 2), you can include `strata=['Progesterone_Status', ...]`
in the call in `.fit`. See documentation in link [E] below.

   Bootstrapping lowess lines. May take a moment...


---
[A]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
[C]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
[D]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
[E]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification






[[<AxesSubplot:xlabel='rank-transformed time\n(p=0.0304)'>,
  <AxesSubplot:xlabel='km-transformed time\n(p=0.0729)'>],
 [<AxesSubplot:xlabel='rank-transformed time\n(p=0.0134)'>,
  <AxesSubplot:xlabel='km-transformed time\n(p=0.0256)'>],
 [<AxesSubplot:xlabel='rank-transformed time\n(p=0.0018)'>,
  <AxesSubplot:xlabel='km-transformed time\n(p=0.0015)'>]]

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-GxkuF4CL-1645703648878)(output_61_4.png)]

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-guK9Oao6-1645703648879)(output_61_5.png)]

[the external chain picture transfer fails. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-mXXEf9qq-1645703648880)(output_61_6.png)]

Pass check_ Through the PH test with the tools of assessments (), it is found that there are three covariates that do not meet the PH assumption, namely Age (continuous variable) and Estrogen_Status (classification variable), progesterone_ Status (classification variable). These variables can be handled by "layering". For example, for the total sample, according to estrogen_ The status classification variable divides the data into sub samples, and then cox models all sub samples. The continuous variable Age can also be processed by layering. First, the continuous variable is divided into multiple levels through the interval (continuous variable - > classified variable), and then cox modeling is carried out for all sub samples.

"""
Process classification variables first Estrogen_Status and Progesterone_Status
"""
FORMULA = 'Age + Race  + T_Stage + N_Stage+Grade+Regional_Node_Examined+Reginol_Node_Positive'
cph_wexp = CoxPHFitter()
cph_wexp.fit(df_train, 'Survival_Months','Status',formula =FORMULA,strata=['Estrogen_Status','Progesterone_Status'])
#Now the remaining Age does not meet the PH assumption
cph_wexp.check_assumptions(df_train, show_plots=True)

The ``p_value_threshold`` is set at 0.01. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.
null_distributionchi squared
degrees_of_freedom1
model<lifelines.CoxPHFitter: fitted with 3219 total...
test_nameproportional_hazard_test
test_statisticp-log2(p)
Agekm3.270.073.82
rank6.040.016.16
Gradekm0.020.890.18
rank0.890.351.54
N_Stagekm1.410.242.09
rank0.050.820.28
Racekm0.390.530.91
rank0.720.401.34
Reginol_Node_Positivekm1.080.301.74
rank1.060.301.72
Regional_Node_Examinedkm0.000.970.04
rank0.340.560.84
T_Stagekm0.010.910.14
rank0.010.940.09
1. Variable 'Age' failed the non-proportional test: p-value is 0.0140.

   Advice 1: the functional form of the variable 'Age' might be incorrect. That is, there may be
non-linear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.

   Advice 2: try binning the variable 'Age' using pd.cut, and then specify it in `strata=['Age',
...]` in the call in `.fit`. See documentation in link [B] below.

   Advice 3: try adding an interaction term with your time variable. See documentation in link [C]
below.


   Bootstrapping lowess lines. May take a moment...


---
[A]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html
[B]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Bin-variable-and-stratify-on-it
[C]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Introduce-time-varying-covariates
[D]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Modify-the-functional-form
[E]  https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Proportional%20hazard%20assumption.html#Stratification






[[<AxesSubplot:xlabel='rank-transformed time\n(p=0.0140)'>,
  <AxesSubplot:xlabel='km-transformed time\n(p=0.0707)'>]]

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-HoIfvY9C-1645703648880)(output_63_4.png)]

"""
Now deal with continuous variables age,Using interval pair age Divide and make into classification variables.
Display with box diagram age The minimum age distribution is 30 years old and the most is about 68 years old, which is concentrated between 47 and 62 years old
"""
df_train.boxplot(column='Age')
df_train2  = df_train.copy()
df_train2['age_strata'] = pd.cut(df_train2['Age'], np.array([30, 45, 60,75]))
df_train2['age_strata'] = pd.factorize(df_train2['age_strata'])[0].astype(np.uint16)

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-FQq8tKTO-1645703648881)(output_64_0.png)]

FORMULA = 'Race  + T_Stage + N_Stage+Grade+Regional_Node_Examined+Reginol_Node_Positive'
cph_strata = CoxPHFitter()
cph_strata.fit(df_train2, 'Survival_Months','Status',formula =FORMULA,strata=['age_strata','Estrogen_Status','Progesterone_Status'])
cph_strata.check_assumptions(df_train2, show_plots=True)
#All variables that do not conform to PH assumptions have been processed. Prompt: proportional hazard assessment looks okay
Proportional hazard assumption looks okay.





[]
#Finally, print the cox model after dealing with covariates that do not meet the PH assumption
#The c-index is found to be lower.
cph_strata.print_summary()
()
cph_strata.plot()
modellifelines.CoxPHFitter
duration col'Survival_Months'
event col'Status'
strata[age_strata, Estrogen_Status, Progesterone_Sta...
baseline estimationbreslow
number of observations3219
number of events observed479
partial log-likelihood-2590.82
time fit was run2022-02-23 17:46:56 UTC
coefexp(coef)se(coef)coef lower 95%coef upper 95%exp(coef) lower 95%exp(coef) upper 95%zp-log2(p)
Grade0.451.570.080.290.601.341.835.73<0.00526.54
N_Stage0.361.440.100.180.551.191.733.83<0.00512.95
Race0.421.530.120.200.651.221.923.64<0.00511.84
Reginol_Node_Positive0.051.050.010.030.081.031.084.14<0.00514.79
Regional_Node_Examined-0.040.970.01-0.05-0.020.950.98-4.70<0.00518.59
T_Stage0.321.380.060.210.431.231.545.59<0.00525.38

Concordance0.70
Partial AIC5193.64
log-likelihood ratio test257.89 on 6 df
-log2(p) of ll-ratio test172.99
<AxesSubplot:xlabel='log(HR) (95% CI)'>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Kb7FI84k-1645703648882)(output_66_2.png)]

3.12 do you really need to care about the proportional risk assumption

The author believes that:

Specific details can be viewed Original link Ha

epilogue

Lifelines is a very convenient survival analysis tool for people familiar with python. But for those familiar with survival analysis, lifelines has many shortcomings. Because this library lacks tools to generate various evaluation indicators, for example, it does not provide tools to draw ROC curve, nomogram, decision curve and so on. The R language provides many packages to draw these diagrams.

In addition, I feel that lifelines does not deal with multi classification variables very well. In R language, multi classification variables are generally processed into factor type, and each sub variable will be compared with a "basic" variable to generate risk ratio. Lifelines feels that it will process multi classified variables into continuous value variables.

After all, most of survival analysis is to study medicine, and the figure is a ready-made tool to solve the problems in scientific research. Lifelines does not provide tools, and people who use them generally do not deliberately build wheels. However, I still hope that more survival analysis tools can be provided in future updates of lifelines.

reference material

1.A concise tutorial on survival analysis

2.Survival Analysis, Cox proportional hazards model and C-index

3.Survival analysis was more than median survival time

4.AUC and C-index are commonly used in clinical research

5.survival analysis

6.How to understand the risk function in survival analysis

7.COX regression of survival analysis

8.Teach you three tricks Cox regression proportional hazards (PH) hypothesis test

8. Tan Ling, Liu Zilin, Tang Linghan, Xiao Jiangwei Construction and validation of long-term survival prediction model for rectal cancer
- Nomogram prediction model [J] Chinese Journal of basic and clinical medicine, 2021,28 (08): 1031-1038

Topics: Python Machine Learning