This vignette explains how to use the functions:
calc_futime()
to calculate follow-up time from index event until next event, death or end of follow-up datepat_status()
to determine patient status at end of follow-uprenumber_time_id()
to calculate a consecutive index of events per case IDreshape_long()
to transpose dataset in wide format to data in long formatreshape_wide()
to transpose dataset in long format to data in wide format (the wide format is required for many package functions)sir_byfutime()
to calculate standardized incidence ratios (SIRs) with custom grouping variables stratified by follow-up timesummarize_sir_results()
to summarize detailed SIR results produced by sir_byfutime()
vital_status()
to determine vital status whether patient is alive or dead at end of follow-upFor some functions there are multiple variants of the same function using varying frameworks. They give the same results but will differ in execution time and memory use:
It is recommended to run the following steps in the correct order to obtain accurate follow-up time calculations
Filter all cases in the long version of the dataset that are relevant for your analysis. Make sure that:
case_id
the index event (e.g. First Cancer FC) is still included and is the one remaining row in the dataset with the smallest case_id
(TUMID3
variable for ZfKD data, and SEQ_NUM
for SEER data)case_id
s might or might not get a countable incident event (e.g. Second Primary Cancer SPC). This event should be the second entry per case_id
(second smallest case_id
) if it is to be countedcount_var
should indicate whether the countable incident event (SPC) has occurred or not. Coded 0
for non-occurrence (or not counted event) and 1
for a counted incident event.Renumber filtered long dataset: In the filter long dataset, you should run the helper function msSPChelpR::renumber_time_id_dt()
(or non-data.table variant msSPChelpR::renumber_time_id()
) that will renumber all events per case_id
and (if step 1 is fulfilled) will assign each index event with time_var_new = 1
and each second (possibly countable incident event) with time_var_new = 2
. Any SIR related function will only count the second event, if additionally to time_var_new = 2
for this row also count_var = 1
is true.
Reshape dataset: Run msSPChelpR::reshape_wide_dt()
or non-data.table-variant msSPChelpR::reshape_wide()
, so that dataset is transposed to wide format (1 row per case_id
, creating variables such as count_var.2
).
Set flag for Second Primary Cancer diagnosis: After filtering and reshaping it is essential to set p_spc
again. This variable will be used by later steps of the analysis.
Determine patient status at a defined end of follow-up by using the msSPChelpR::pat_status()
function. This date for end of follow-up must:
be in “YYYY-MM-DD” format and is always defined via the fu_end =
parameter
must precede the end of data collection. E.g. if the last incident events for the dataset you are using are collected at the end of 2014, your fu_end
must be fu_end = "2014-12-15"
or earlier.
Based on the newly calculated patient status, you might want to exclude cases for which patient status cannot be determined
msSPChelpR::calc_futime()
function and the same fu_end
as for step 6. By standard all functions of the msSPChelpR
package require follow-up times as numeric years.In order to calculate SIR using the package functions, the following data structure is needed: * Wide format data wide_df
with one row per patient that has encountered the index event (i.e. diagnosed with a first primary cancer FC)
wide_df
needs to contain the following variables (columns) per patient (row):
region_var
- variable in df that contains information on region where case was incident.agegroup_var
- variable in df that contains information on age-group.sex_var
- variable in df that contains information on biological sex.year_var
- variable in df that contains information on year or year-period when case was incident.site_var
- variable in df that contains information on case (count event) diagnosis. Cases are usually the second cancers. Diagnoses can use any coding system (e.g. ICD) but coding system between dataset and reference data must be coherent.futime_var
- variable in df that contains follow-up time per person between date of first cancer and any of death, date of event (case), end of FU date (in years; whatever event comes first). In case you have not calculated the FU time yet, you can use the workflow described in the previous chapter.If your data has the required structure, you can calculate and summarize SIRs with the following two steps:
msSPChelpR::sir_byfutime()
function. For this calculation usually a reference dataset is required that defines the population standard rates. refrates_df
must use the same category coding of age, sex, region, year and cancer_site as agegroup_var
, sex_var
, region_var
, year_var
and site_var
msSPChelpR::summarize_sir_results()
function on the stratified sir results produced by the previous step.In the next version of this vignette the theoretical considerations how SIRs are calculated will be explained in this chapter.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
library(msSPChelpR)
#Load synthetic dataset of patients with cancer to demonstrate package functions
data("us_second_cancer")
#This dataset is in long format, so each tumor is a separate row in the data
us_second_cancer#> # A tibble: 113,999 x 15
#> fake_id SEQ_NUM registry sex race datebirth t_datediag t_site_icd t_dco
#> <chr> <int> <chr> <chr> <chr> <date> <date> <chr> <chr>
#> 1 100004 1 SEER Reg ~ Male White 1926-01-01 1992-07-15 C50 hist~
#> 2 100004 2 SEER Reg ~ Male White 1926-01-01 2004-01-15 C54 hist~
#> 3 100004 3 SEER Reg ~ Male White 1926-01-01 2006-06-15 C34 hist~
#> 4 100004 4 SEER Reg ~ Male White 1926-01-01 2018-06-15 C14 DCO ~
#> 5 100034 1 SEER Reg ~ Male White 1979-01-01 2000-06-15 C50 hist~
#> 6 100037 1 SEER Reg ~ Fema~ White 1938-01-01 1996-01-15 C54 hist~
#> 7 100038 1 SEER Reg ~ Male White 1989-01-01 1991-04-15 C50 hist~
#> 8 100038 2 SEER Reg ~ Male White 1989-01-01 2000-03-15 C80 hist~
#> 9 100039 1 SEER Reg ~ Fema~ White 1946-01-01 2003-08-15 C50 hist~
#> 10 100039 2 SEER Reg ~ Fema~ White 1946-01-01 2011-04-15 C34 hist~
#> # ... with 113,989 more rows, and 6 more variables: fc_age <int>,
#> # datedeath <date>, p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>,
#> # t_yeardiag <chr>
#filter for lung cancer
<- us_second_cancer %>%
ids #detect ids with any lung cancer
filter(t_site_icd == "C34") %>%
select(fake_id) %>%
as.vector() %>%
unname() %>%
unlist()
<- us_second_cancer %>%
filtered_usdata #filter according to above detected ids with any lung cancer diagnosis
filter(fake_id %in% ids) %>%
arrange(fake_id)
filtered_usdata#> # A tibble: 62,661 x 15
#> fake_id SEQ_NUM registry sex race datebirth t_datediag t_site_icd t_dco
#> <chr> <int> <chr> <chr> <chr> <date> <date> <chr> <chr>
#> 1 100004 1 SEER Reg ~ Male White 1926-01-01 1992-07-15 C50 hist~
#> 2 100004 2 SEER Reg ~ Male White 1926-01-01 2004-01-15 C54 hist~
#> 3 100004 3 SEER Reg ~ Male White 1926-01-01 2006-06-15 C34 hist~
#> 4 100004 4 SEER Reg ~ Male White 1926-01-01 2018-06-15 C14 DCO ~
#> 5 100039 1 SEER Reg ~ Fema~ White 1946-01-01 2003-08-15 C50 hist~
#> 6 100039 2 SEER Reg ~ Fema~ White 1946-01-01 2011-04-15 C34 hist~
#> 7 100039 3 SEER Reg ~ Fema~ White 1946-01-01 2018-01-15 C80 hist~
#> 8 100073 1 SEER Reg ~ Male White 1960-01-01 1993-11-15 C44 hist~
#> 9 100073 2 SEER Reg ~ Male White 1960-01-01 2003-12-15 C34 hist~
#> 10 100143 1 SEER Reg ~ Male White 1944-01-01 1992-03-15 C50 hist~
#> # ... with 62,651 more rows, and 6 more variables: fc_age <int>,
#> # datedeath <date>, p_alive <chr>, p_dodmin <date>, fc_agegroup <chr>,
#> # t_yeardiag <chr>
time_id
<- filtered_usdata %>%
renumbered_usdata renumber_time_id(new_time_id_var = "t_tumid",
dattype = "seer",
case_id_var = "fake_id")
%>%
renumbered_usdata select(fake_id, sex, t_site_icd, t_datediag, t_tumid)
#> # A tibble: 62,661 x 5
#> fake_id sex t_site_icd t_datediag t_tumid
#> <chr> <chr> <chr> <date> <int>
#> 1 100004 Male C50 1992-07-15 1
#> 2 100004 Male C54 2004-01-15 2
#> 3 100004 Male C34 2006-06-15 3
#> 4 100004 Male C14 2018-06-15 4
#> 5 100039 Female C50 2003-08-15 1
#> 6 100039 Female C34 2011-04-15 2
#> 7 100039 Female C80 2018-01-15 3
#> 8 100073 Male C44 1993-11-15 1
#> 9 100073 Male C34 2003-12-15 2
#> 10 100143 Male C50 1992-03-15 1
#> # ... with 62,651 more rows
<- renumbered_usdata %>%
usdata_wide reshape_wide_tidyr(case_id_var = "fake_id", time_id_var = "t_tumid", timevar_max = 10)
#now the data is in the wide format as required by many package functions.
#This means, each case is a row and several tumors per case ID are
#add new columns to the data using the time_id as column name suffix.
usdata_wide#> # A tibble: 31,997 x 127
#> fake_id SEQ_NUM.1 registry.1 sex.1 race.1 datebirth.1 t_datediag.1
#> <chr> <int> <chr> <chr> <chr> <date> <date>
#> 1 100004 1 SEER Reg 20 - Detroi~ Male White 1926-01-01 1992-07-15
#> 2 100039 1 SEER Reg 02 - Connec~ Fema~ White 1946-01-01 2003-08-15
#> 3 100073 1 SEER Reg 01 - San Fr~ Male White 1960-01-01 1993-11-15
#> 4 100143 1 SEER Reg 02 - Connec~ Male White 1944-01-01 1992-03-15
#> 5 100182 1 SEER Reg 02 - Connec~ Male Other 1927-01-01 1991-09-15
#> 6 100197 1 SEER Reg 02 - Connec~ Fema~ White 1945-01-01 2012-06-15
#> 7 100208 1 SEER Reg 02 - Connec~ Male White 1970-01-01 2019-11-15
#> 8 100230 1 SEER Reg 01 - San Fr~ Male White 1947-01-01 1992-11-15
#> 9 100234 1 SEER Reg 01 - San Fr~ Male White 1988-01-01 2010-02-15
#> 10 100266 1 SEER Reg 01 - San Fr~ Fema~ White 1956-01-01 2010-07-15
#> # ... with 31,987 more rows, and 120 more variables: t_site_icd.1 <chr>,
#> # t_dco.1 <chr>, fc_age.1 <int>, datedeath.1 <date>, p_alive.1 <chr>,
#> # p_dodmin.1 <date>, fc_agegroup.1 <chr>, t_yeardiag.1 <chr>,
#> # SEQ_NUM.2 <int>, registry.2 <chr>, sex.2 <chr>, race.2 <chr>,
#> # datebirth.2 <date>, t_datediag.2 <date>, t_site_icd.2 <chr>, t_dco.2 <chr>,
#> # fc_age.2 <int>, datedeath.2 <date>, p_alive.2 <chr>, p_dodmin.2 <date>,
#> # fc_agegroup.2 <chr>, t_yeardiag.2 <chr>, SEQ_NUM.3 <int>, registry.3 <chr>,
#> # sex.3 <chr>, race.3 <chr>, datebirth.3 <date>, t_datediag.3 <date>,
#> # t_site_icd.3 <chr>, t_dco.3 <chr>, fc_age.3 <int>, datedeath.3 <date>,
#> # p_alive.3 <chr>, p_dodmin.3 <date>, fc_agegroup.3 <chr>,
#> # t_yeardiag.3 <chr>, SEQ_NUM.4 <int>, registry.4 <chr>, sex.4 <chr>,
#> # race.4 <chr>, datebirth.4 <date>, t_datediag.4 <date>, t_site_icd.4 <chr>,
#> # t_dco.4 <chr>, fc_age.4 <int>, datedeath.4 <date>, p_alive.4 <chr>,
#> # p_dodmin.4 <date>, fc_agegroup.4 <chr>, t_yeardiag.4 <chr>,
#> # SEQ_NUM.5 <int>, registry.5 <chr>, sex.5 <chr>, race.5 <chr>,
#> # datebirth.5 <date>, t_datediag.5 <date>, t_site_icd.5 <chr>, t_dco.5 <chr>,
#> # fc_age.5 <int>, datedeath.5 <date>, p_alive.5 <chr>, p_dodmin.5 <date>,
#> # fc_agegroup.5 <chr>, t_yeardiag.5 <chr>, SEQ_NUM.6 <int>, registry.6 <chr>,
#> # sex.6 <chr>, race.6 <chr>, datebirth.6 <date>, t_datediag.6 <date>,
#> # t_site_icd.6 <chr>, t_dco.6 <chr>, fc_age.6 <int>, datedeath.6 <date>,
#> # p_alive.6 <chr>, p_dodmin.6 <date>, fc_agegroup.6 <chr>,
#> # t_yeardiag.6 <chr>, SEQ_NUM.7 <int>, registry.7 <chr>, sex.7 <chr>,
#> # race.7 <chr>, datebirth.7 <date>, t_datediag.7 <date>, t_site_icd.7 <chr>,
#> # t_dco.7 <chr>, fc_age.7 <int>, datedeath.7 <date>, p_alive.7 <chr>,
#> # p_dodmin.7 <date>, fc_agegroup.7 <chr>, t_yeardiag.7 <chr>,
#> # SEQ_NUM.8 <int>, registry.8 <chr>, sex.8 <chr>, race.8 <chr>,
#> # datebirth.8 <date>, t_datediag.8 <date>, t_site_icd.8 <chr>, t_dco.8 <chr>,
#> # ...
p_spc
<- usdata_wide %>%
usdata_wide ::mutate(p_spc = dplyr::case_when(is.na(t_site_icd.2) ~ "No SPC",
dplyr!is.na(t_site_icd.2) ~ "SPC developed",
TRUE ~ NA_character_)) %>%
#create the same information as numeric variable count_spc
::mutate(count_spc = dplyr::case_when(is.na(t_site_icd.2) ~ 1,
dplyrTRUE ~ 0))
%>%
usdata_wide ::select(fake_id, sex.1, p_spc, count_spc, t_site_icd.1,
dplyr.1, t_site_icd.2, t_datediag.2)
t_datediag#> # A tibble: 31,997 x 8
#> fake_id sex.1 p_spc count_spc t_site_icd.1 t_datediag.1 t_site_icd.2
#> <chr> <chr> <chr> <dbl> <chr> <date> <chr>
#> 1 100004 Male SPC developed 0 C50 1992-07-15 C54
#> 2 100039 Female SPC developed 0 C50 2003-08-15 C34
#> 3 100073 Male SPC developed 0 C44 1993-11-15 C34
#> 4 100143 Male SPC developed 0 C50 1992-03-15 C34
#> 5 100182 Male SPC developed 0 C18 1991-09-15 C34
#> 6 100197 Female SPC developed 0 C34 2012-06-15 C50
#> 7 100208 Male No SPC 1 C34 2019-11-15 <NA>
#> 8 100230 Male SPC developed 0 C44 1992-11-15 C34
#> 9 100234 Male No SPC 1 C34 2010-02-15 <NA>
#> 10 100266 Female No SPC 1 C34 2010-07-15 <NA>
#> # ... with 31,987 more rows, and 1 more variable: t_datediag.2 <date>
<- usdata_wide %>%
usdata_wide pat_status(., fu_end = "2017-12-31", dattype = "seer",
status_var = "p_status", life_var = "p_alive.1",
spc_var = "p_spc", birthdat_var = "datebirth.1",
lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
use_lifedatmin = FALSE, check = TRUE,
as_labelled_factor = TRUE)
#> # A tibble: 10 x 3
#> p_alive.1 p_status n
#> <chr> <fct> <int>
#> 1 Alive Patient alive after FC (with or without following SPC after ~ 5940
#> 2 Alive Patient alive after SPC 11316
#> 3 Alive NA - Patient not born before end of FU 4
#> 4 Alive NA - Patient did not develop cancer before end of FU 849
#> 5 Dead Patient alive after FC (with or without following SPC after ~ 863
#> 6 Dead Patient alive after SPC 1360
#> 7 Dead Patient dead after FC 6208
#> 8 Dead Patient dead after SPC 5325
#> 9 Dead NA - Patient did not develop cancer before end of FU 68
#> 10 Dead NA - Patient date of death is missing 64
#> # A tibble: 7 x 2
#> p_status n
#> <fct> <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU) 6803
#> 2 Patient alive after SPC 12676
#> 3 Patient dead after FC 6208
#> 4 Patient dead after SPC 5325
#> 5 NA - Patient not born before end of FU 4
#> 6 NA - Patient did not develop cancer before end of FU 917
#> 7 NA - Patient date of death is missing 64
%>%
usdata_wide ::select(fake_id, p_status, p_alive.1, datedeath.1, t_site_icd.1, t_datediag.1,
dplyr.2, t_datediag.2)
t_site_icd#> # A tibble: 31,997 x 8
#> fake_id p_status p_alive.1 datedeath.1 t_site_icd.1 t_datediag.1 t_site_icd.2
#> <chr> <fct> <chr> <date> <chr> <date> <chr>
#> 1 100004 Patient~ Alive NA C50 1992-07-15 C54
#> 2 100039 Patient~ Alive NA C50 2003-08-15 C34
#> 3 100073 Patient~ Dead 2005-06-01 C44 1993-11-15 C34
#> 4 100143 Patient~ Alive NA C50 1992-03-15 C34
#> 5 100182 Patient~ Dead 2007-05-01 C18 1991-09-15 C34
#> 6 100197 Patient~ Alive NA C34 2012-06-15 C50
#> 7 100208 NA - Pa~ Alive NA C34 2019-11-15 <NA>
#> 8 100230 Patient~ Dead 2008-05-01 C44 1992-11-15 C34
#> 9 100234 Patient~ Dead 2015-07-01 C34 2010-02-15 <NA>
#> 10 100266 Patient~ Alive NA C34 2010-07-15 <NA>
#> # ... with 31,987 more rows, and 1 more variable: t_datediag.2 <date>
#alternatively, you can impute the date of death using lifedatmin_var
%>%
usdata_wide pat_status(., fu_end = "2017-12-31", dattype = "seer",
status_var = "p_status", life_var = "p_alive.1",
spc_var = "p_spc", birthdat_var = "datebirth.1",
lifedat_var = "datedeath.1", fcdat_var = "t_datediag.1",
spcdat_var = "t_datediag.2", life_stat_alive = "Alive",
life_stat_dead = "Dead", spc_stat_yes = "SPC developed",
spc_stat_no = "No SPC", lifedat_fu_end = "2019-12-31",
use_lifedatmin = TRUE, lifedatmin_var = "p_dodmin.1",
check = TRUE, as_labelled_factor = TRUE)
#> # A tibble: 9 x 3
#> p_alive.1 p_status n
#> <chr> <fct> <int>
#> 1 Alive Patient alive after FC (with or without following SPC after e~ 5940
#> 2 Alive Patient alive after SPC 11316
#> 3 Alive NA - Patient not born before end of FU 4
#> 4 Alive NA - Patient did not develop cancer before end of FU 849
#> 5 Dead Patient alive after FC (with or without following SPC after e~ 867
#> 6 Dead Patient alive after SPC 1361
#> 7 Dead Patient dead after FC 6230
#> 8 Dead Patient dead after SPC 5362
#> 9 Dead NA - Patient did not develop cancer before end of FU 68
#> # A tibble: 6 x 2
#> p_status n
#> <fct> <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU) 6807
#> 2 Patient alive after SPC 12677
#> 3 Patient dead after FC 6230
#> 4 Patient dead after SPC 5362
#> 5 NA - Patient not born before end of FU 4
#> 6 NA - Patient did not develop cancer before end of FU 917
#> # A tibble: 31,997 x 130
#> fake_id SEQ_NUM.1 registry.1 sex.1 race.1 datebirth.1 t_datediag.1
#> <chr> <int> <chr> <chr> <chr> <date> <date>
#> 1 100004 1 SEER Reg 20 - Detroi~ Male White 1926-01-01 1992-07-15
#> 2 100039 1 SEER Reg 02 - Connec~ Fema~ White 1946-01-01 2003-08-15
#> 3 100073 1 SEER Reg 01 - San Fr~ Male White 1960-01-01 1993-11-15
#> 4 100143 1 SEER Reg 02 - Connec~ Male White 1944-01-01 1992-03-15
#> 5 100182 1 SEER Reg 02 - Connec~ Male Other 1927-01-01 1991-09-15
#> 6 100197 1 SEER Reg 02 - Connec~ Fema~ White 1945-01-01 2012-06-15
#> 7 100208 1 SEER Reg 02 - Connec~ Male White 1970-01-01 2019-11-15
#> 8 100230 1 SEER Reg 01 - San Fr~ Male White 1947-01-01 1992-11-15
#> 9 100234 1 SEER Reg 01 - San Fr~ Male White 1988-01-01 2010-02-15
#> 10 100266 1 SEER Reg 01 - San Fr~ Fema~ White 1956-01-01 2010-07-15
#> # ... with 31,987 more rows, and 123 more variables: t_site_icd.1 <chr>,
#> # t_dco.1 <chr>, fc_age.1 <int>, datedeath.1 <date>, p_alive.1 <chr>,
#> # p_dodmin.1 <date>, fc_agegroup.1 <chr>, t_yeardiag.1 <chr>,
#> # SEQ_NUM.2 <int>, registry.2 <chr>, sex.2 <chr>, race.2 <chr>,
#> # datebirth.2 <date>, t_datediag.2 <date>, t_site_icd.2 <chr>, t_dco.2 <chr>,
#> # fc_age.2 <int>, datedeath.2 <date>, p_alive.2 <chr>, p_dodmin.2 <date>,
#> # fc_agegroup.2 <chr>, t_yeardiag.2 <chr>, SEQ_NUM.3 <int>, registry.3 <chr>,
#> # sex.3 <chr>, race.3 <chr>, datebirth.3 <date>, t_datediag.3 <date>,
#> # t_site_icd.3 <chr>, t_dco.3 <chr>, fc_age.3 <int>, datedeath.3 <date>,
#> # p_alive.3 <chr>, p_dodmin.3 <date>, fc_agegroup.3 <chr>,
#> # t_yeardiag.3 <chr>, SEQ_NUM.4 <int>, registry.4 <chr>, sex.4 <chr>,
#> # race.4 <chr>, datebirth.4 <date>, t_datediag.4 <date>, t_site_icd.4 <chr>,
#> # t_dco.4 <chr>, fc_age.4 <int>, datedeath.4 <date>, p_alive.4 <chr>,
#> # p_dodmin.4 <date>, fc_agegroup.4 <chr>, t_yeardiag.4 <chr>,
#> # SEQ_NUM.5 <int>, registry.5 <chr>, sex.5 <chr>, race.5 <chr>,
#> # datebirth.5 <date>, t_datediag.5 <date>, t_site_icd.5 <chr>, t_dco.5 <chr>,
#> # fc_age.5 <int>, datedeath.5 <date>, p_alive.5 <chr>, p_dodmin.5 <date>,
#> # fc_agegroup.5 <chr>, t_yeardiag.5 <chr>, SEQ_NUM.6 <int>, registry.6 <chr>,
#> # sex.6 <chr>, race.6 <chr>, datebirth.6 <date>, t_datediag.6 <date>,
#> # t_site_icd.6 <chr>, t_dco.6 <chr>, fc_age.6 <int>, datedeath.6 <date>,
#> # p_alive.6 <chr>, p_dodmin.6 <date>, fc_agegroup.6 <chr>,
#> # t_yeardiag.6 <chr>, SEQ_NUM.7 <int>, registry.7 <chr>, sex.7 <chr>,
#> # race.7 <chr>, datebirth.7 <date>, t_datediag.7 <date>, t_site_icd.7 <chr>,
#> # t_dco.7 <chr>, fc_age.7 <int>, datedeath.7 <date>, p_alive.7 <chr>,
#> # p_dodmin.7 <date>, fc_agegroup.7 <chr>, t_yeardiag.7 <chr>,
#> # SEQ_NUM.8 <int>, registry.8 <chr>, sex.8 <chr>, race.8 <chr>,
#> # datebirth.8 <date>, t_datediag.8 <date>, t_site_icd.8 <chr>, t_dco.8 <chr>,
#> # ...
<- usdata_wide %>%
usdata_wide ::filter(!p_status %in% c("NA - Patient not born before end of FU",
dplyr"NA - Patient did not develop cancer before end of FU",
"NA - Patient date of death is missing"))
%>%
usdata_wide ::count(p_status)
dplyr#> # A tibble: 4 x 2
#> p_status n
#> <fct> <int>
#> 1 Patient alive after FC (with or without following SPC after end of FU) 6803
#> 2 Patient alive after SPC 12676
#> 3 Patient dead after FC 6208
#> 4 Patient dead after SPC 5325
<- usdata_wide %>%
usdata_wide calc_futime(., futime_var_new = "p_futimeyrs", fu_end = "2017-12-31",
dattype = "seer", time_unit = "years",
lifedat_var = "datedeath.1",
fcdat_var = "t_datediag.1", spcdat_var = "t_datediag.2")
#> # A tibble: 4 x 5
#> p_status mean_futime min_futime max_futime median_futime
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Patient alive after FC (with ~ 9.58 0.0438 27.0 8.29
#> 2 Patient alive after SPC 8.69 0 26.9 7.50
#> 3 Patient dead after FC 8.54 0 25.8 7.47
#> 4 Patient dead after SPC 6.33 0 26.5 5.08
%>%
usdata_wide ::select(fake_id, p_status, p_futimeyrs, p_alive.1, datedeath.1, t_datediag.1, t_datediag.2)
dplyr#> # A tibble: 31,012 x 7
#> fake_id p_status p_futimeyrs p_alive.1 datedeath.1 t_datediag.1 t_datediag.2
#> <chr> <fct> <dbl> <chr> <date> <date> <date>
#> 1 100004 Patient ~ 11.5 Alive NA 1992-07-15 2004-01-15
#> 2 100039 Patient ~ 7.67 Alive NA 2003-08-15 2011-04-15
#> 3 100073 Patient ~ 10.1 Dead 2005-06-01 1993-11-15 2003-12-15
#> 4 100143 Patient ~ 3.33 Alive NA 1992-03-15 1995-07-15
#> 5 100182 Patient ~ 7.08 Dead 2007-05-01 1991-09-15 1998-10-15
#> 6 100197 Patient ~ 4.83 Alive NA 2012-06-15 2017-04-15
#> 7 100230 Patient ~ 11.0 Dead 2008-05-01 1992-11-15 2003-11-15
#> 8 100234 Patient ~ 5.37 Dead 2015-07-01 2010-02-15 NA
#> 9 100266 Patient ~ 7.46 Alive NA 2010-07-15 NA
#> 10 100274 Patient ~ 2.38 Dead 2006-06-01 2004-01-15 NA
#> # ... with 31,002 more rows
<- usdata_wide %>%
sircalc_results sir_byfutime(
dattype = "seer",
ybreak_vars = c("race.1", "t_dco.1"),
xbreak_var = "none",
futime_breaks = c(0, 1/12, 2/12, 1, 5, 10, Inf),
count_var = "count_spc",
refrates_df = us_refrates_icd2,
calc_total_row = TRUE,
calc_total_fu = TRUE,
region_var = "registry.1",
age_var = "fc_agegroup.1",
sex_var = "sex.1",
year_var = "t_yeardiag.1",
site_var = "t_site_icd.1", #using grouping by second cancer incidence
futime_var = "p_futimeyrs",
alpha = 0.05)
#>
#> [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed.
#> # A tidytable: 4 x 10
#> age sex region year race.1 t_site to1month i_observed i_pyar n_base
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 05 - ~ Male SEER Reg ~ 2015~ Other C34 1 1 0 1
#> 2 15 - ~ Male SEER Reg ~ 2005~ Black C34 1 1 0 1
#> 3 85 - ~ Female SEER Reg ~ 1995~ Black C34 1 2 0 2
#> 4 85 - ~ Male SEER Reg ~ 2010~ Other C34 1 1 0 1
#> Check attribute `problems_not_empty` of results to see what strata are affected.
#> This might be caused by cases where SPC occured at the same day as first cancer.
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#>
#> [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed.
#> # A tidytable: 4 x 10
#> age sex region year race.1 t_site Total0toInfyears i_observed i_pyar
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 05 - ~ Male SEER Reg~ 2015~ Other C34 1 1 0
#> 2 15 - ~ Male SEER Reg~ 2005~ Black C34 1 1 0
#> 3 85 - ~ Female SEER Reg~ 1995~ Black C34 1 2 0
#> 4 85 - ~ Male SEER Reg~ 2010~ Other C34 1 1 0
#> # ... with 1 more variable: n_base <dbl>
#> Check attribute `problems_not_empty` of results to see what strata are affected.
#> This might be caused by cases where SPC occured at the same day as first cancer.
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#>
#> [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed.
#> # A tidytable: 1 x 10
#> age sex region year t_dco.1 t_site to1month i_observed i_pyar n_base
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 80 - ~ Male SEER Reg ~ 2010~ DCO ca~ C34 1 1 0 1
#> Check attribute `problems_not_empty` of results to see what strata are affected.
#> This might be caused by cases where SPC occured at the same day as first cancer.
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#>
#> [INFO Cases 0 PYARs] There are conflicts where strata with 0 follow-up time have data in observed.
#> # A tidytable: 1 x 10
#> age sex region year t_dco.1 t_site Total0toInfyears i_observed i_pyar
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 80 - ~ Male SEER Reg~ 2010~ DCO ca~ C34 1 1 0
#> # ... with 1 more variable: n_base <dbl>
#> Check attribute `problems_not_empty` of results to see what strata are affected.
#> This might be caused by cases where SPC occured at the same day as first cancer.
#> You can check this by excluding all cases from wide_df, where date of first diagnosis is equal.
#>
#> There are observed cases in the results file that do not occur in the refrates_df.
#> A possible explanation can be:
#> - DCO cases
#> - diagnosis of second cancer occured in different time period than first cancer
#> The following strata are affected:
#> # A tidytable: 166 x 21
#> yvar_name yvar_label fu_time age sex region year t_site i_observed
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 total_var Overall 1-5 yea~ 60 - ~ Male SEER Reg ~ 2010~ C34 17
#> 2 total_var Overall 1-5 yea~ 65 - ~ Male SEER Reg ~ 2015~ C34 18
#> 3 total_var Overall 5-10 ye~ 60 - ~ Male SEER Reg ~ 2010~ C34 20
#> 4 total_var Overall 5-10 ye~ 60 - ~ Male SEER Reg ~ 2010~ C34 22
#> 5 total_var Overall 5-10 ye~ 70 - ~ Male SEER Reg ~ 2005~ C34 12
#> 6 total_var Overall 5-10 ye~ 75 - ~ Fema~ SEER Reg ~ 2005~ C34 13
#> 7 total_var Overall 10+ yea~ 25 - ~ Fema~ SEER Reg ~ 1995~ C34 14
#> 8 total_var Overall 10+ yea~ 60 - ~ Fema~ SEER Reg ~ 1995~ C34 18
#> 9 total_var Overall 10+ yea~ 60 - ~ Fema~ SEER Reg ~ 1995~ C34 16
#> 10 total_var Overall 10+ yea~ 65 - ~ Male SEER Reg ~ 2000~ C34 29
#> # ... with 156 more rows, and 12 more variables: i_pyar <dbl>, n_base <dbl>,
#> # race <chr>, incidence_cases <int>, population_pyar <int>,
#> # incidence_crude_rate <dbl>, i_expected <dbl>, sir <dbl>, sir_lci <dbl>,
#> # sir_uci <dbl>, fu_time_sort <int>, yvar_sort <int>
#>
#> Check attribute `notes_refcases` of results to see what strata are affected.
# sircalc_results %>% print(n = 100) # uncomment after tidytable 0.5.6 release
#The summarize function is versatile. Her for example the summary by
%>%
sircalc_results #summarize results across region, age, year and t_site
summarize_sir_results(.,
summarize_groups = c("region", "age", "year", "race"),
summarize_site = TRUE,
output = "long", output_information = "minimal",
add_total_row = "only", add_total_fu = "no",
collapse_ci = FALSE, shorten_total_cols = TRUE,
fubreak_var_name = "fu_time", ybreak_var_name = "yvar_name",
xbreak_var_name = "none", site_var_name = "t_site",
alpha = 0.05
%>%
) ::select(-region, -age, -year, -race, -sex, -yvar_name)
dplyr#> Warning: The results file `sir_df` contains observed cases in i_observed that do not occur in the refrates_df (ref_inc_cases).
#> Therefore calculation of the variables n_base and ref_population_pyar is ambiguous.
#> We take the first value of each variable. Expect small inconsistencies in the calculation of n_base, ref_population_pyar and ref_inc_crude_rate across strata.
#> If you want to know more, please check the `warnings` column of `sir_df`.
#> # A tidytable: 7 x 8
#> yvar_label fu_time fu_time_sort t_site observed expected sir sir_ci
#> <chr> <chr> <int> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Overall to 1 month 1 Total 327 18.1 18.1 16.19 - ~
#> 2 Overall 0.0833-0.167~ 2 Total 80 17.9 4.46 3.54 - 5~
#> 3 Overall 0.167-1 years 3 Total 724 172. 4.2 3.9 - 4.~
#> 4 Overall 1-5 years 4 Total 2998 668. 4.49 4.33 - 4~
#> 5 Overall 5-10 years 5 Total 3089 531. 5.82 5.61 - 6~
#> 6 Overall 10+ years 6 Total 4241 438. 9.69 9.4 - 9.~
#> 7 Overall Total 0 to I~ 7 Total 11459 1845. 6.21 6.1 - 6.~
sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=German_Germany.1252
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
#> [5] LC_TIME=German_Germany.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] msSPChelpR_0.8.7 magrittr_2.0.1 dplyr_1.0.6
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.6 pillar_1.6.1 bslib_0.2.5.1 compiler_4.0.5
#> [5] jquerylib_0.1.4 prettyunits_1.1.1 progress_1.2.2 forcats_0.5.1
#> [9] tools_4.0.5 digest_0.6.27 jsonlite_1.7.2 lubridate_1.7.10
#> [13] evaluate_0.14 lifecycle_1.0.0 tibble_3.1.2 pkgconfig_2.0.3
#> [17] rlang_0.4.11 DBI_1.1.1 cli_2.5.0 rstudioapi_0.13
#> [21] yaml_2.2.1 haven_2.4.1 xfun_0.23 stringr_1.4.0
#> [25] knitr_1.33 hms_1.1.0 generics_0.1.0 vctrs_0.3.8
#> [29] sass_0.4.0 sjlabelled_1.1.8 tidyselect_1.1.1 snakecase_0.11.0
#> [33] glue_1.4.2 data.table_1.14.0 R6_2.5.0 fansi_0.5.0
#> [37] rmarkdown_2.9 purrr_0.3.4 tidyr_1.1.3 ps_1.6.0
#> [41] ellipsis_0.3.2 htmltools_0.5.1.1 insight_0.14.1 assertthat_0.2.1
#> [45] tidytable_0.6.2 utf8_1.2.1 stringi_1.6.2 crayon_1.4.1