In many cases, the connected families that we have to integrate over may be large but sparsely connected. As an example, consider the following three generation family:

# the data set we use
dat <- data.frame(
  id = 1:48, 
  mom = c(NA, NA, 2L, 2L, 2L, NA, NA, 7L, 7L, 7L, 3L, 3L, 3L, 3L, NA, 15L, 15L, 43L, 18L, NA, NA, 21L, 21L, 9L, 9L, 9L, 9L, NA, NA, 29L, 29L, 29L, 30L, 30L, NA, NA, 36L, 36L, 36L, 38L, 38L, NA, NA, 43L, 43L, 43L, 32L, 32L), 
  dad = c(NA, NA, 1L, 1L, 1L, NA, NA, 6L, 6L, 6L, 8L, 8L, 8L, 8L, NA, 4L, 4L, 42L, 5L, NA, NA, 20L, 20L, 22L, 22L, 22L, 22L, NA, NA, 28L, 28L, 28L, 23L, 23L, NA, NA, 35L, 35L, 35L, 31L, 31L, NA, NA, 42L, 42L, 42L, 45L, 45L), 
  sex = c(1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L))

# plot the pedigree
library(kinship2, quietly = TRUE)
ped <- with(dat, pedigree(id = id, dadid = dad, momid = mom, sex = sex))
par(mar = c(1, 1, 1, 1))
plot(ped)

In the plot, circles are females and squares are males. An individual may be repeated in which case duplicate entries are illustrated by a dashed line. If we observe outcomes for all 48 members of the family then we have to do 48 dimensional as everybody are marginally dependent if we have an additive genetic effect or similar effects. We assume that this is true in the rest of this vignette. Though it is possible also use the same method we describe here in other cases if one specifies a suitable graph, like the one we will get to soon, for the random effect structure.

To see that everybody are connected, notice that

The relationships can also be represented as graph. For this purposes, we assign the functions below to create a list of edges between parents and their children. You can likely skip the functions.

# returns the mothers id.
# 
# Args:
#   pedigree: the pedigree object.
get_momid <- function(peddat)
  with(peddat, 
       vapply(mindex, function(x) if(x > 0L) id[x] else NA_integer_, 1L))

# returns the fathers id.
# 
# Args:
#   pedigree: the pedigree object.
get_dadid <- function(peddat)
  with(peddat, 
       vapply(findex, function(x) if(x > 0L) id[x] else NA_integer_, 1L))

# creates an edge list to pass to igraph. An edge is included between children 
# and parents.
# 
# Args:
#   pedigree: the pedigree object.
create_igraph_input <- function(peddat){
  id <- peddat$id
  father <- get_dadid(peddat)
  mother <- get_momid(peddat)
  
  # TODO: this is O(n^2)
  stopifnot(anyDuplicated(id) < 1)
  out <- lapply(id, function(x){
    # find the children
    children_idx <- which(x == father | x == mother)
    children <- if(length(children_idx) > 0)
      id[children_idx] else NULL
    
    # get the correct order (smallest first) and return
    is_larger <- x > children
    
    cbind(
      ifelse(is_larger, children, x        ), 
      ifelse(is_larger, x       , children))
  })
  
  out <- do.call(rbind, out)
  as.data.frame(out[!duplicated(out), ])
}

The graph version of the family looks like this:

# For some reason, kinship2::pedigree requires that we provide both a father 
# and mother or none. Therefore, we create a mock object. You can skip this
get_pedigree_mock <- function(id, dadid, momid, sex){
  if(is.factor(sex))
    sex <- as.integer(sex)
  
  # checks
  n <- length(id)
  stopifnot(n > 0, length(dadid) == n, length(momid) == n, length(sex) == n, 
            all(is.finite(sex)), all(sex %in% 1:2), 
            all(is.na(dadid) | dadid %in% id), 
            all(is.na(momid) | momid %in% id), 
            all(is.finite(id)))
  
  # create objects to return
  findex <- match(dadid, id, nomatch = 0L)
  mindex <- match(momid, id, nomatch = 0L)
  
  structure(
    list(famid = rep(1L, n), id = id, findex = findex, mindex = mindex,
         sex = factor(sex, levels = 1:2, labels = c("male", "famle"))), 
    class = "pedigree")
}

# assign function to plot the pedigree as a graph
do_graph_plot <- function(dat){
  ped <- with(dat, get_pedigree_mock(
    id = id, dadid = dad, momid = mom, sex = sex))
  library(igraph, quietly = TRUE)
  g_dat <- create_igraph_input(ped)
  graph_fam <-  graph.data.frame(g_dat, directed = FALSE)
  par(mar = c(1, 1, 1, 1))
  plot(graph_fam, vertex.size = 12, vertex.color = "gray",
       vertex.label.cex = .75, layout = layout_with_kk)
}
do_graph_plot(dat)

A node/vertex in the graph represents an individual and an edge indicates a parent-child relation. The graph have the property that if we split (find a connected partition of) the above into two sub-families (subgraphs) by removing child-parent relations (removing edges) then we only have to do two smaller integrals rather than one large. Having said that, the 48 dimensional integral that we show here is not a problem but higher dimensional integrals may be.

Implemented Methods

The above suggests the following procedure for simplifying the computational problem:

  1. Find a split (connected partition) such that the two sub-families (subgraphs) are roughly the same size (an approximately balanced partition).
  2. Of the sets that satisfies 1., find the one that removes the least amount of relationships (cuts the least amount of edges).

We have implemented procedures to do just this. It also turns out that it is fine to not satisfy number 1. That is, to not get two connected sets in the partition. We will return to this later and to what we have implemented but first provide an example. We start by splitting the above family in way that only satisfies 1.:

# get the partition 
library(pedmod)
only_balanced <- max_balanced_partition_pedigree(
  id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L)
#> Exited early. Balance criterion is 4
#> Exited early. Balance criterion is 7
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 24
#> Found perfect balanced connected partition early. Balance criterion is 24
#> Found approximately balanced connected partition. Balance criterion is 24
only_balanced$balance_criterion # the smallest of the two sets
#> [1] 24
removed_edges <- only_balanced$removed_edges
removed_edges # the relationships we have removed
#>      from to
#> [1,]    6  8
#> [2,]    6 10
#> [3,]    7  9
#> [4,]   32 47
#> [5,]   32 48

# assign function to get the new data. You can likely skip this
get_reduced_data <- function(dat, removed_edges){
  for(i in 1:nrow(removed_edges)){
    . <- function(child, parent){
      idx   <- which(dat$id       == removed_edges[i, child])
      idx_m <- which(dat$mom[idx] == removed_edges[i, parent])
      if(length(idx_m > 0)){
        dat[idx[idx_m], "mom"] <<- NA_integer_
        return(TRUE)
      }
      idx_d <- which(dat$dad[idx] == removed_edges[i, parent])
      if(length(idx_d > 0)){
        dat[idx[idx_d], "dad"] <<- NA_integer_
        return(TRUE)
      }
      FALSE
    }
    if(.(1L, 2L))
      next
    .(2L, 1L)
  }
  
  dat
}
new_dat <- get_reduced_data(dat, removed_edges)

# redo the plot
do_graph_plot(new_dat)

The above shows the family after we remove the 5 relationships (edges) that our algorithm finds. To illustrate where the cut is in the original graph, we can color the vertices according to which set they are in:

# assign function to show the split in the original graph
show_split <- function(dat, partition_obj){
  ped <- with(dat, pedigree(id = id, dadid = dad, momid = mom, sex = sex))
  g_dat <- create_igraph_input(ped)
  graph_fam <-  graph.data.frame(g_dat, directed = FALSE)
  nam <- vertex_attr(graph_fam)$name
  V(graph_fam)$color[nam %in% partition_obj$set_1] <- "lightblue" 
  V(graph_fam)$color[nam %in% partition_obj$set_2] <- "lightgray" 
  par(mar = c(1, 1, 1, 1))
  plot(graph_fam, vertex.size = 12, vertex.label.cex = .75, 
       layout = layout_with_kk)
}
show_split(dat, only_balanced)

Reducing the Cut Cost

Next, we use the slack argument to reduce the number of relationships we remove (the number of edges we cut):

also_cut <- max_balanced_partition_pedigree(
  id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L, 
  slack = .1)
#> Exited early. Balance criterion is 4
#> Exited early. Balance criterion is 7
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 24
#> Found perfect balanced connected partition early. Balance criterion is 24
#> Found approximately balanced connected partition. Balance criterion is 24
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 17 and 21. The cut cost is 5
#> Starting iteration 1
#> Found 6 vertices to move
#> Keept 2 moves with a gain of 1 and a balance criterion of 24
#> The cut cost is 4
#> Starting iteration 2
#> Found 6 vertices to move
#> Keept 0 moves
#> The cut cost is 4
also_cut$balance_criterion # the smallest of the two sets
#> [1] 24
removed_edges <- also_cut$removed_edges
removed_edges # the relationships we have removed
#>      from to
#> [1,]    6  9
#> [2,]    7  9
#> [3,]   32 48
#> [4,]   45 47

# redo the plot
show_split(dat, also_cut)

We only removed 4 relationships (edges) this time.

Using Individual Weights (Vertex Weights)

In many applications, we only observe some individuals. Therefore, we want the to split the family into equal sizes in terms of the individuals that we actually observe. We can do this by providing individual weights (vertex weights).

As an example, we will take the family we have been working with and assume that we only observe individuals in the final generation. Thus, we will set the weight of these individuals to a one and the remaining to some small positive number (say \(10^{-5}\)):

# add the weights
is_final <- c(16:17, 11:14, 24:27, 33:34, 40:41, 47:48, 19L)
dat$id_weight <- ifelse(dat$id %in% is_final, 1., 1e-5)

weighted_partition <- max_balanced_partition_pedigree(
  id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L, 
  slack = .1, id_weight = dat$id_weight)
#> Exited early. Balance criterion is 4e-05
#> Exited early. Balance criterion is 2.00005
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 8.00019
#> Exited early. Balance criterion is 2.00001
#> Found approximately balanced connected partition. Balance criterion is 8.00019
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 20 and 18. The cut cost is 6
#> Starting iteration 1
#> Found 8 vertices to move
#> Keept 1 moves with a gain of 2 and a balance criterion of 8.00018
#> The cut cost is 4
#> Starting iteration 2
#> Found 7 vertices to move
#> Keept 0 moves
#> The cut cost is 4
weighted_partition$balance_criterion # the smallest of the two sets
#> [1] 8
removed_edges <- weighted_partition$removed_edges
removed_edges # the relationships we have removed
#>      from to
#> [1,]    6  8
#> [2,]    7  8
#> [3,]   32 47
#> [4,]   32 48

# plot the new graph
plot_weighted <- function(dat, removed_edges){
  new_dat <- get_reduced_data(dat, removed_edges)
  ped <- with(new_dat, get_pedigree_mock(
    id = id, dadid = dad, momid = mom, sex = sex))
  g_dat <- create_igraph_input(ped)
  graph_fam <- graph.data.frame(g_dat, directed = FALSE)
  V(graph_fam)$color <- ifelse(vertex_attr(graph_fam)$name %in% is_final, 
                               "gray", "white")
  par(mar = c(1, 1, 1, 1))
  plot(graph_fam, vertex.size = 12, vertex.label.cex = .75, 
       layout = layout_with_kk)}

plot_weighted(dat, removed_edges)

show_split(dat, weighted_partition)

We have colored vertices which we observe in the first graph to make it easy to see that there is a as close as possible to an equal number in each sub-family.

Using Relationship Weights (Edge Weights)

We can notice that the previous solution removes relations between the individuals we observe and their immediate ancestors. We can add a larger weight to these relationships to avoid or discourage this. To do so, we need to use the father_weight and mother_weight arguments as shown below:

# add the weights
is_final <- c(16:17, 11:14, 24:27, 33:34, 40:41, 47:48, 19L)
dat$id_weight <- ifelse(dat$id %in% is_final, 1., 1e-5)

# add the edge weights
dat$father_weight <- dat$mother_weight <- ifelse(dat$id %in% is_final, 10., 1.)

# find the partition
weighted_partition <- max_balanced_partition_pedigree(
  id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L, 
  slack = .1, id_weight = dat$id_weight, father_weight = dat$father_weight, 
  mother_weight = dat$mother_weight)
#> Exited early. Balance criterion is 4e-05
#> Exited early. Balance criterion is 2.00005
#> Cannot improve balance criterion further
#> Found split. Balance criterion is 8.00019
#> Exited early. Balance criterion is 2.00001
#> Found approximately balanced connected partition. Balance criterion is 8.00019
#> Starting to reduce the cost of the cut in a block of size 38. The partition in the block consists of two sets of size 20 and 18. The cut cost is 60
#> Starting iteration 1
#> Found 8 vertices to move
#> Keept 2 moves with a gain of 56 and a balance criterion of 8.00017
#> The cut cost is 4
#> Starting iteration 2
#> Found 6 vertices to move
#> Keept 0 moves
#> The cut cost is 4
weighted_partition$balance_criterion # the smallest of the two sets
#> [1] 8
removed_edges <- weighted_partition$removed_edges
removed_edges # the relationships we have removed
#>      from to
#> [1,]    6  8
#> [2,]    7  8
#> [3,]   28 32
#> [4,]   29 32

# plot the new graph
plot_weighted(dat, removed_edges)

show_split(dat, weighted_partition)

We end up cutting the link to the grandparents instead.

Potentially Unconnected Partition

We may not require that the two sets in the partition is connected. The unconnected_partition_pedigree function does not impose this constraint. An example is given below:

# without weights
partition <- unconnected_partition_pedigree(
  id = dat$id, father.id = dat$dad, mother.id = dat$mom, trace = 2L, 
  slack = .1)
#> Starting from an initial paratition with a balance criterion of 0 with a total vertex weight of 48
#> The cut cost is 0
#> Starting to build the first partition that satisfy the balance criterion
#> Starting from an initial paratition with a balance criterion of 20 and a cut cost of 10
#> Starting iteration 1
#> Found 47 vertices to move
#> Keept 14 moves with a gain of 3 and a balance criterion of 24
#> The cut cost is 7
#> Starting iteration 2
#> Found 47 vertices to move
#> Keept 15 moves with a gain of 1 and a balance criterion of 21
#> The cut cost is 6
#> Starting iteration 3
#> Found 47 vertices to move
#> Keept 3 moves with a gain of 0 and a balance criterion of 24
#> The cut cost is 6
#> Starting iteration 4
#> Found 47 vertices to move
#> Keept 0 moves
#> The cut cost is 6
partition$removed_edges # the relationships we have removed
#>      from to
#> [1,]    1  3
#> [2,]    2  3
#> [3,]   28 31
#> [4,]   29 31
#> [5,]   32 47
#> [6,]   32 48

# show the partition
show_split(dat, partition)