Abstract
Survival analysis examines and models the time it takes for events to occur, termed survival time. The Cox proportional-hazards regression model is the most common tool for studying the dependency of survival time on predictor variables. This appendix to Fox and Weisberg (2019) briefly describes the basis for the Cox regression model, and explains how to use the survival package in R to estimate Cox regressions.
1 Introduction
Survival analysis examines and models the time it takes for events to occur. The prototypical such event is death, from which the name “survival analysis” and much of its terminology derives, but the ambit of application of survival analysis is much broader. Essentially the same methods are employed in a variety of disciplines under various rubrics—for example, “event-history analysis” in sociology and “failure-time analysis” in engineering. In this appendix, therefore, terms such as survival are to be understood generically.
Survival analysis focuses on the distribution of survival times. Although there are well known methods for estimating unconditional survival distributions, most interesting survival modeling ex- amines the relationship between survival and one or more predictors, usually termed covariates in the survival-analysis literature. The subject of this appendix is the Cox proportional-hazards regression model introduced in a seminal paper by Cox, 1972, a broadly applicable and the most widely used method of survival analysis. The survival package in R (Therneau, 1999; Therneau and Grambsch, 2000) fits Cox models, as we describe here, and most other commonly used survival methods.1
As is the case for the other on-line appendices to An R Companion to Applied Regression, we assume that you have read the R Companion and are therefore familiar with R.2 In addition, we assume familiarity with Cox regression. We nevertheless begin with a review of basic concepts, primarily to establish terminology and notation. The second section of the appendix takes up the Cox proportional-hazards model with time-independent covariates. Time-dependent covariates are introduced in the third section. A fourth and final section deals with diagnostics.
1The survival package is one of the “recommended” packages that are included in the standard R distribution.
The package must be loaded via the command library("survival").
2Most R functions used but not described in this appendix are discussed in Fox and Weisberg (2019). All the R code used in this appendix can be downloaded from http://tinyurl.com/carbook or via the carWeb() function in the car package.
2 Basic Concepts and Notation
Let T represent survival time. We regard T as a random variable with cumulative distribution function P (t) = Pr(T t) and probability density function p(t) = dP (t)/dt.3 The more optimistic survival function S(t) is the complement of the distribution function, S(t) = Pr(T > t) = 1 P (t). A fourth representation of the distribution of survival times is the hazard function, which assesses the instantaneous risk of demise at time t, conditional on survival to that time:
h(t)=lim∆t→0f (t)Pr [(t ≤ T < t + ∆t)|T ≥ t]∆t=S(t)
Models for survival data usually employ the hazard function or the log hazard. For example, assuming a constant hazard, h(t) = ν, implies an exponential distribution of survival times, with density function p(t) = νe−νt. Other common hazard models includelog h(t) = ν + ρt
which leads to the Gompertz distribution of survival times, andlog h(t) = ν + ρ log(t)
which leads to the Weibull distribution of survival times. (See, for example, Cox and Oakes, 1984, Sec. 2.3, for these and other possibilities.) In both the Gompertz and Weibull distributions, the hazard can either increase or decrease with time; moreover, in both instances, setting ρ = 0 yields the exponential model.
A nearly universal feature of survival data is censoring, the most common form of which is right-censoring : Here, the period of observation expires, or an individual is removed from the study, before the event occurs—for example, some individuals may still be alive at the end of a clinical trial, or may drop out of the study for various reasons other than death prior to its termination. A case is left-censored if its initial time at risk is unknown. Indeed, the same case may be both right and left-censored, a circumstance termed interval-censoring. Censoring complicates the likelihood function, and hence the estimation, of survival models.
Moreover, conditional on the value of any covariates in a survival model and on an individual’s survival to a particular time, censoring must be independent of the future value of the hazard for the individual. If this condition is not met, then estimates of the survival distribution can be seriously biased. For example, if individuals tend to drop out of a clinical trial shortly before they die, and therefore their deaths go unobserved, survival time will be over-estimated. Censoring that meets this requirement is noninformative. A common instance of noninformative censoring occurs when a study terminates at a predetermined date.
3 The Cox Proportional-Hazards Model
Survival analysis typically examines the relationship of the survival distribution to covariates. Most commonly, this examination entails the specification of a linear-like model for the log hazard. For example, a parametric model based on the exponential distribution may be written as
log hi(t) = α + β1xi1 + β2xi2 + • • • + βkxik
3If you’re unfamiliar with calculus, the essence of the matter here is that areas under the density function p(t) represent probabilities of death in a given time interval, while the distribution function P (t) represents the probability of death by time t.
that is, as a linear model for the log-hazard or as a multiplicative model for the hazard. Here, i is a subscript for case, and the xs are the covariates. The constant α in this model represents a kind of baseline log-hazard, because log hi(t) = α, or hi(t) = eα, when all of the xs are zero. There are similar parametric regression models based on the other survival distributions described in the preceding section.4
The Cox model, in contrast, leaves the baseline hazard function α(t) = log h0(t) unspecified: log hi(t) = α(t) + β1xi1 + β2xi2
This model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly. Consider, now, two cases i and i′ that differ in their x-values, with the corresponding linear predictors
The hazard ratio for these two cases
independent of time t. Consequently, the Cox model is a proportional-hazards model.
Remarkably, even though the baseline hazard is unspecified, the Cox model can still be esti- mated by the method of partial likelihood, developed by Cox (1972) in the same paper in which he introduced what came to called the Cox model. Although the resulting estimates are not as efficient as maximum-likelihood estimates for a correctly specified parametric hazard regression model, not having to make arbitrary, and possibly incorrect, assumptions about the form of the baseline hazard is a compensating virtue of Cox’s specification. Having fit the model
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme