Restricted Cubic Spline

Cubic Spline Function

Suppose a continuous variable x is divided into three intervals at cutpoints a and b, a model can be built between y and x by assuming piecewise linear in which there is a linear relationship for each of three intervals but with different slope.

A cubic spline curve is a piecewise cubic curve with continuous second derivative. Instead of a linear line, a cubic polynomial has a good ability to model a non-linear relationship by fitting a curve. Meanwhile, a piecewise cubic spline is smooth on the joint points (cutpoints) by forcing the first and second derivatives to be same on the joint points.

A piecewise cubic spline with two cutpoints a and b on the x variable has 5 regression coefficients in the model related to the variable x.

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \beta_5X_5 + \cdots \beta_{\cdots}X_{\cdots}\]

  • \(X_1 = X\)
  • \(X_2 = X^2\)
  • \(X_3 = X^3\)
  • \(X_4 = (X-a)^3_+\), any value of X less than a is set as zero.
  • \(X_5 = (X-b)^3_+\), any value of X less than b is set as zero.
  • Notation
    • \(U_+ = 0\), if \(U <= 0\)
    • \(U_+ = U\), if \(U > 0\)

Restricted Cubic Spline

The restricted cubic spline restrains the functions to be linear before the first knot and after the last knot. It has only \(k - 1\) parameters must be estimated as opposed to \(k + 3\) parameters with unrestricted cubic spline where \(k\) is the number of knots.

\[f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_{[k-1]}X_{k-1}\] where \(X_1 = X\) and for \(j = 1,...,k-2\), \[X_{j+1} = (X-t_j)_+^3 - (X-t_{k-1})_+^3(t_k - t_j)/(t_k - t_{k-1}) + (X - t_k)_+^3(t_{k-1} - t_j)/(t_k - t_{k-1})\]

In case of four knots \(t_1, t_2, t_3, t_4\) which means that there are four cutpoints with five intervals on a line, then \(k = 4\), there should be \(k - 1 = 3\) variables \(X_1, X_2, X_3\) being generated and put into the model, and three corresponding parameters need to be estimated with the restricted cubic spline.

  • \(X_1 = X\)
  • For \(j = 1,...,k-2\), when \(k = 4\), \(j\) has values of \(1\) and \(2\).
    • When \(j = 1, k = 4\): \(X_2 = (X - t_1)_+^3 - (X - t_3)_+^3(t_4 - t_1)/(t_4 - t_3) + (X - t_4)_+^3(t_3 - t_1)/(t_4 - t_3)\)
    • When \(j = 2, k = 4\): \(X_3 = (X - t_2)_+^3 - (X - t_3)_+^3(t_4 - t_2)/(t_4 - t_3) + (X - t_4)_+^3(t_3 - t_2)/(t_4 - t_3)\)
  • Reference: Frank Harrel, Hmisc Package, “rscpline.eval” Function

gam Package

set.seed(123456)
dt <- data.table(x = y$x, y = y$V3)
ids <- sample(1:nrow(dt), size = 250, replace = TRUE)

dt <- dt[ids, ]
p0 <- 0.2
dt <- dt[, odds0 := p0 / (1 - p0)
         ][, log_odds := log(odds0) + log(y)
           ][, p := exp(log_odds) / (1 + exp(log_odds))]

vsample <- function(p){
    sample(c(1, 0), size = 1, replace = TRUE, prob = c(p, 1 - p))
}

vsample <- Vectorize(vsample)

dt <- dt[, outcome := vsample(p)]

library(mgcv)
library(sjPlot)
m <- gam(
    outcome ~ s(x, bs = "cr", k = 10)
  , data=dt
  , method="REML"
  , family = binomial
)


prt_coef.gam <- function(obj, exp = FALSE, digits = 2){
    obj_summary <- summary(obj)
    rtn <- data.table(
        coef_name = names(obj_summary$p.coeff)
      , coef_value = obj_summary$p.coeff
      , df = sum(obj$edf)
    )
    rtn <- merge(
        x = rtn
      , y = data.table(coef_name = names(obj_summary$se), se = obj_summary$se)
      , on = "coef_name"
      , all.x = TRUE
      , all.y = FALSE
    )
    rtn[, coef_value_lower := coef_value + qt(0.025, 999) * se
        ][, coef_value_upper := coef_value + qt(0.975, 999) * se
          ]
    if (exp == TRUE){
        rtn[, coef_value := exp(coef_value)
            ][, coef_value_lower := exp(coef_value_lower)
              ][, coef_value_upper := exp(coef_value_upper)]
    }
    rtn[, ci := paste0(
              format(round(coef_value, digits), nsmall = digits)
            , "("
            , format(round(coef_value_lower, digits), nsmall = digits)
            , ","
            , format(round(coef_value_upper, digits), nsmall = digits)
            , ")")
        ]
    ## print(rtn)

    if (obj$family$family == "multinom"){
        ## print(obj$family$family)
    }
    return(rtn)
}

get_coef.gam <- function(obj, exp=FALSE, digits=2){
    obj_summary <- summary(obj)
    rtn <- obj_summary$p.table
    rtn <- as.data.table(rtn)
    colnames(rtn) <- c("coef_value", "se", "t_value", "p_value")
    rtn$coef_name <- rownames(obj_summary$p.table)
    rtn <- rtn[, list(coef_name, coef_value, se, t_value, p_value)]
    rtn <- rtn[, coef_value_lower := coef_value + qnorm(0.025) * se
               ][, coef_value_upper := coef_value + qnorm(0.975) * se
                 ]
    if (exp == TRUE) {
        rtn <- rtn[, coef_value := exp(coef_value)
                   ][, coef_value_lower := exp(coef_value_lower)
                     ][, coef_value_upper := exp(coef_value_upper)]
    }
    rtn <- rtn[, ci := paste0(
                     format(round(coef_value, digits), nsmall = digits)
                   , "("
                   , format(round(coef_value_lower, digits), nsmall = digits)
                   , ","
                   , format(round(coef_value_upper, digits), nsmall = digits)
                   , ")")
               ]
}

t <- get_coef.gam(m)
t[, list(coef_name, ci, p_value)] %>%
    prt(caption="Fixed Effects", col.names=c("Variable", "Effect (95% CI)", "p-value"))
Fixed Effects
Variable Effect (95% CI) p-value
(Intercept) -3.29(-5.82,-0.77) 0.0105635
Wald tests of the smooth term (experience)
rn edf p-value
s(x) 2.618781 0

Plot Nonlinear Relationship

Computing Environment

R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] sjPlot_2.8.10 Hmisc_4.7-0 Formula_1.2-4
[4] survival_3.2-13 lattice_0.20-45 Wu_0.0.0.9000
[7] flexdashboard_0.5.2 lme4_1.1-29 Matrix_1.4-0
[10] mgcv_1.8-38 nlme_3.1-152 png_0.1-7
[13] scales_1.2.0 nnet_7.3-16 labelled_2.9.1
[16] kableExtra_1.3.4 plotly_4.10.0 gridExtra_2.3
[19] ggplot2_3.3.6 DT_0.23 tableone_0.13.2
[22] magrittr_2.0.3 lubridate_1.8.0 dplyr_1.0.9
[25] plyr_1.8.7 data.table_1.14.2 rmdformats_1.0.4
[28] knitr_1.39

loaded via a namespace (and not attached): [1] TH.data_1.1-1 minqa_1.2.4 colorspace_2.0-3
[4] ellipsis_0.3.2 sjlabelled_1.2.0 estimability_1.4
[7] htmlTable_2.4.0 parameters_0.18.1 base64enc_0.1-3
[10] rstudioapi_0.13 farver_2.1.1 fansi_1.0.3
[13] mvtnorm_1.1-3 xml2_1.3.3 codetools_0.2-18
[16] splines_4.2.0 sjmisc_2.8.9 jsonlite_1.8.0
[19] nloptr_2.0.3 ggeffects_1.1.2 broom_0.8.0
[22] cluster_2.1.2 effectsize_0.7.0 compiler_4.2.0
[25] httr_1.4.3 sjstats_0.18.1 emmeans_1.7.5
[28] backports_1.4.1 assertthat_0.2.1 fastmap_1.1.0
[31] lazyeval_0.2.2 survey_4.1-1 cli_3.3.0
[34] htmltools_0.5.3 tools_4.2.0 coda_0.19-4
[37] gtable_0.3.0 glue_1.6.2 Rcpp_1.0.9
[40] jquerylib_0.1.4 vctrs_0.4.1 svglite_2.1.0
[43] insight_0.18.0 xfun_0.31 stringr_1.4.0
[46] rvest_1.0.2 lifecycle_1.0.1 klippy_0.0.0.9500
[49] MASS_7.3-54 zoo_1.8-10 hms_1.1.1
[52] sandwich_3.0-2 RColorBrewer_1.1-3 yaml_2.3.5
[55] sass_0.4.1 rpart_4.1-15 latticeExtra_0.6-29 [58] stringi_1.7.8 highr_0.9 bayestestR_0.12.1
[61] checkmate_2.1.0 boot_1.3-28 rlang_1.0.4
[64] pkgconfig_2.0.3 systemfonts_1.0.4 evaluate_0.15
[67] purrr_0.3.4 labeling_0.4.2 htmlwidgets_1.5.4
[70] tidyselect_1.1.2 bookdown_0.27 R6_2.5.1
[73] generics_0.1.3 multcomp_1.4-19 DBI_1.1.2
[76] pillar_1.8.0 haven_2.5.0 foreign_0.8-81
[79] withr_2.5.0 datawizard_0.4.1 tibble_3.1.8
[82] performance_0.9.1 modelr_0.1.8 utf8_1.2.2
[85] rmarkdown_2.14 jpeg_0.1-9 grid_4.2.0
[88] forcats_0.5.1 digest_0.6.29 webshot_0.5.3
[91] xtable_1.8-4 tidyr_1.2.0 munsell_0.5.0
[94] viridisLite_0.4.0 bslib_0.3.1 mitools_2.4