Restricted Cubic Spline
Cubic Spline Function
Suppose a continuous variable x is divided into three intervals at cutpoints a and b, a model can be built between y and x by assuming piecewise linear in which there is a linear relationship for each of three intervals but with different slope.
A cubic spline curve is a piecewise cubic curve with continuous second derivative. Instead of a linear line, a cubic polynomial has a good ability to model a non-linear relationship by fitting a curve. Meanwhile, a piecewise cubic spline is smooth on the joint points (cutpoints) by forcing the first and second derivatives to be same on the joint points.
A piecewise cubic spline with two cutpoints a and b on the x variable has 5 regression coefficients in the model related to the variable x.
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \beta_5X_5 + \cdots \beta_{\cdots}X_{\cdots}\]
- \(X_1 = X\)
- \(X_2 = X^2\)
- \(X_3 = X^3\)
- \(X_4 = (X-a)^3_+\), any value of X less than a is set as zero.
- \(X_5 = (X-b)^3_+\), any value of X less than b is set as zero.
- Notation
- \(U_+ = 0\), if \(U <= 0\)
- \(U_+ = U\), if \(U > 0\)
Restricted Cubic Spline
The restricted cubic spline restrains the functions to be linear before the first knot and after the last knot. It has only \(k - 1\) parameters must be estimated as opposed to \(k + 3\) parameters with unrestricted cubic spline where \(k\) is the number of knots.
\[f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_{[k-1]}X_{k-1}\] where \(X_1 = X\) and for \(j = 1,...,k-2\), \[X_{j+1} = (X-t_j)_+^3 - (X-t_{k-1})_+^3(t_k - t_j)/(t_k - t_{k-1}) + (X - t_k)_+^3(t_{k-1} - t_j)/(t_k - t_{k-1})\]
In case of four knots \(t_1, t_2, t_3, t_4\) which means that there are four cutpoints with five intervals on a line, then \(k = 4\), there should be \(k - 1 = 3\) variables \(X_1, X_2, X_3\) being generated and put into the model, and three corresponding parameters need to be estimated with the restricted cubic spline.
- \(X_1 = X\)
- For \(j = 1,...,k-2\), when \(k = 4\), \(j\) has values of \(1\) and \(2\).
- When \(j = 1, k = 4\): \(X_2 = (X - t_1)_+^3 - (X - t_3)_+^3(t_4 - t_1)/(t_4 - t_3) + (X - t_4)_+^3(t_3 - t_1)/(t_4 - t_3)\)
- When \(j = 2, k = 4\): \(X_3 = (X - t_2)_+^3 - (X - t_3)_+^3(t_4 - t_2)/(t_4 - t_3) + (X - t_4)_+^3(t_3 - t_2)/(t_4 - t_3)\)
- Reference: Frank Harrel, Hmisc Package, “rscpline.eval” Function
library(Hmisc)
x <- seq(0,100,by=1)
knots <- quantile(x,c(0.05,0.35,0.65,0.95))
t <- quantile(x,c(0.05,0.35,0.65,0.95))
y <- rcspline.eval(x,nk=4,inclx = TRUE)
y <- as.data.frame(y)
## str(y)
## dim(y)
y$y2 <- (x >= t[1])*(x - t[1])^3 - (x >= t[3])*(x - t[3])^3*(t[4]-t[1])/(t[4]-t[3]) + (x >= t[4])*(x-t[4])^3*(t[3]-t[1])/(t[4]-t[3])
y$y3 <- (x >= t[2])*(x - t[2])^3 - (x >= t[3])*(x - t[3])^3*(t[4]-t[2])/(t[4]-t[3]) + (x >= t[4])*(x-t[4])^3*(t[3]-t[2])/(t[4]-t[3])
tau <- (t[4] - t[1])^2
y$V2b <- y$V2*tau
y$V3b <- y$V3*tau
# View(y)
## sum(y$V3b - y$y3)
## sum(y$V2b - y$y2)
plot(y$x, y$V2)
gam Package
set.seed(123456)
dt <- data.table(x = y$x, y = y$V3)
ids <- sample(1:nrow(dt), size = 250, replace = TRUE)
dt <- dt[ids, ]
p0 <- 0.2
dt <- dt[, odds0 := p0 / (1 - p0)
][, log_odds := log(odds0) + log(y)
][, p := exp(log_odds) / (1 + exp(log_odds))]
vsample <- function(p){
sample(c(1, 0), size = 1, replace = TRUE, prob = c(p, 1 - p))
}
vsample <- Vectorize(vsample)
dt <- dt[, outcome := vsample(p)]
library(mgcv)
library(sjPlot)
m <- gam(
outcome ~ s(x, bs = "cr", k = 10)
, data=dt
, method="REML"
, family = binomial
)
prt_coef.gam <- function(obj, exp = FALSE, digits = 2){
obj_summary <- summary(obj)
rtn <- data.table(
coef_name = names(obj_summary$p.coeff)
, coef_value = obj_summary$p.coeff
, df = sum(obj$edf)
)
rtn <- merge(
x = rtn
, y = data.table(coef_name = names(obj_summary$se), se = obj_summary$se)
, on = "coef_name"
, all.x = TRUE
, all.y = FALSE
)
rtn[, coef_value_lower := coef_value + qt(0.025, 999) * se
][, coef_value_upper := coef_value + qt(0.975, 999) * se
]
if (exp == TRUE){
rtn[, coef_value := exp(coef_value)
][, coef_value_lower := exp(coef_value_lower)
][, coef_value_upper := exp(coef_value_upper)]
}
rtn[, ci := paste0(
format(round(coef_value, digits), nsmall = digits)
, "("
, format(round(coef_value_lower, digits), nsmall = digits)
, ","
, format(round(coef_value_upper, digits), nsmall = digits)
, ")")
]
## print(rtn)
if (obj$family$family == "multinom"){
## print(obj$family$family)
}
return(rtn)
}
get_coef.gam <- function(obj, exp=FALSE, digits=2){
obj_summary <- summary(obj)
rtn <- obj_summary$p.table
rtn <- as.data.table(rtn)
colnames(rtn) <- c("coef_value", "se", "t_value", "p_value")
rtn$coef_name <- rownames(obj_summary$p.table)
rtn <- rtn[, list(coef_name, coef_value, se, t_value, p_value)]
rtn <- rtn[, coef_value_lower := coef_value + qnorm(0.025) * se
][, coef_value_upper := coef_value + qnorm(0.975) * se
]
if (exp == TRUE) {
rtn <- rtn[, coef_value := exp(coef_value)
][, coef_value_lower := exp(coef_value_lower)
][, coef_value_upper := exp(coef_value_upper)]
}
rtn <- rtn[, ci := paste0(
format(round(coef_value, digits), nsmall = digits)
, "("
, format(round(coef_value_lower, digits), nsmall = digits)
, ","
, format(round(coef_value_upper, digits), nsmall = digits)
, ")")
]
}
t <- get_coef.gam(m)
t[, list(coef_name, ci, p_value)] %>%
prt(caption="Fixed Effects", col.names=c("Variable", "Effect (95% CI)", "p-value"))
Variable | Effect (95% CI) | p-value |
---|---|---|
(Intercept) | -3.29(-5.82,-0.77) | 0.0105635 |
t2 <- anova(m)$s.table
t2 <- as.data.table(t2, keep.rownames = T)
t2[-2, -(3:4)] %>% prt(caption="Wald tests of the smooth term (experience)")
rn | edf | p-value |
---|---|---|
s(x) | 2.618781 | 0 |
get_pred <- function(obj){
m_age <- median(dt$age)
## 59;
m_bmi <- median(dt$bmi, na.rm = TRUE)
## 27.4
m_apacheiiscore <- median(dt$apacheiiscore)
## 21
m_baselinesat_dcs <- median(dt$baselinesat_dcs, na.rm = TRUE)
## 99
newdata <- data.table(
x = dt$x
)
prd <- predict(obj, newdata, type = "link", se.fit = TRUE)
newdata$fit <- prd$fit
newdata$se <- prd$se.fit
newdata <- newdata[
, lower := fit - qnorm(0.975) * se
][, upper := fit + qnorm(0.975) * se]
return(newdata)
}
prd_m1 <- get_pred(m)
x_breaks <- prd_m1[x %in% seq(0, 100, 25)]$x
x_labels <- prd_m1[x %in% seq(0, 100, 25)]$x
dtm <- m$model
dtm$ystart <- -30
dtm$yend <- -28
xj <- jitter(dtm$x, 0.5)
bd <- ggplot(data = prd_m1
, aes(x = x
, y = fit)) +
geom_line(color = "#666666") +
geom_ribbon(aes(ymin = lower, ymax = upper, linetype = NA)
, color = "#666666"
, alpha = 0.2
, inherit.aes = TRUE
, show.legend = FALSE
) +
geom_segment(data=dtm
, x=xj
, xend=xj
, y=dtm$ystart
, yend=dtm$yend
, alpha=0.3
, color = "#666666"
) +
labs(x = "x"
, y = "Linear Predictor") +
scale_x_continuous(breaks = x_breaks
, labels = x_labels
, limits = c(0, 100)
, expand = c(0, 0)) +
scale_y_continuous(breaks = seq(-30, 5, by=5)
, limits = c(-30, 5)
, expand = c(0, 0)) +
theme(
## panel.grid.major = element_blank()
## , panel.grid.minor = element_blank()
panel.background = element_blank()
## , axis.ticks.x = element_blank()
)
bd
Plot Nonlinear Relationship
Computing Environment
R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] sjPlot_2.8.10 Hmisc_4.7-0 Formula_1.2-4
[4] survival_3.2-13 lattice_0.20-45 Wu_0.0.0.9000
[7] flexdashboard_0.5.2 lme4_1.1-29 Matrix_1.4-0
[10] mgcv_1.8-38 nlme_3.1-152 png_0.1-7
[13] scales_1.2.0 nnet_7.3-16 labelled_2.9.1
[16] kableExtra_1.3.4 plotly_4.10.0 gridExtra_2.3
[19] ggplot2_3.3.6 DT_0.23 tableone_0.13.2
[22] magrittr_2.0.3 lubridate_1.8.0 dplyr_1.0.9
[25] plyr_1.8.7 data.table_1.14.2 rmdformats_1.0.4
[28] knitr_1.39
loaded via a namespace (and not attached): [1] TH.data_1.1-1 minqa_1.2.4 colorspace_2.0-3
[4] ellipsis_0.3.2 sjlabelled_1.2.0 estimability_1.4
[7] htmlTable_2.4.0 parameters_0.18.1 base64enc_0.1-3
[10] rstudioapi_0.13 farver_2.1.1 fansi_1.0.3
[13] mvtnorm_1.1-3 xml2_1.3.3 codetools_0.2-18
[16] splines_4.2.0 sjmisc_2.8.9 jsonlite_1.8.0
[19] nloptr_2.0.3 ggeffects_1.1.2 broom_0.8.0
[22] cluster_2.1.2 effectsize_0.7.0 compiler_4.2.0
[25] httr_1.4.3 sjstats_0.18.1 emmeans_1.7.5
[28] backports_1.4.1 assertthat_0.2.1 fastmap_1.1.0
[31] lazyeval_0.2.2 survey_4.1-1 cli_3.3.0
[34] htmltools_0.5.3 tools_4.2.0 coda_0.19-4
[37] gtable_0.3.0 glue_1.6.2 Rcpp_1.0.9
[40] jquerylib_0.1.4 vctrs_0.4.1 svglite_2.1.0
[43] insight_0.18.0 xfun_0.31 stringr_1.4.0
[46] rvest_1.0.2 lifecycle_1.0.1 klippy_0.0.0.9500
[49] MASS_7.3-54 zoo_1.8-10 hms_1.1.1
[52] sandwich_3.0-2 RColorBrewer_1.1-3 yaml_2.3.5
[55] sass_0.4.1 rpart_4.1-15 latticeExtra_0.6-29 [58] stringi_1.7.8 highr_0.9 bayestestR_0.12.1
[61] checkmate_2.1.0 boot_1.3-28 rlang_1.0.4
[64] pkgconfig_2.0.3 systemfonts_1.0.4 evaluate_0.15
[67] purrr_0.3.4 labeling_0.4.2 htmlwidgets_1.5.4
[70] tidyselect_1.1.2 bookdown_0.27 R6_2.5.1
[73] generics_0.1.3 multcomp_1.4-19 DBI_1.1.2
[76] pillar_1.8.0 haven_2.5.0 foreign_0.8-81
[79] withr_2.5.0 datawizard_0.4.1 tibble_3.1.8
[82] performance_0.9.1 modelr_0.1.8 utf8_1.2.2
[85] rmarkdown_2.14 jpeg_0.1-9 grid_4.2.0
[88] forcats_0.5.1 digest_0.6.29 webshot_0.5.3
[91] xtable_1.8-4 tidyr_1.2.0 munsell_0.5.0
[94] viridisLite_0.4.0 bslib_0.3.1 mitools_2.4