R-Forge Logo

Welcome to Chain graph models in R project!

Software for fitting (block recursive) chain graphs in R.

This is the homepage of the project Chain Graph Models in R. On this page you can find links to software, papers, talks, data related to the package gRchain. Below you can also find will a tutorial on using the functions from gRchain.

Talks:

Talk Date Place
Psychoco 2013 (slides) 14.02.2013-15.02.2013 Zuerich, Switzerland

Software:

The project summary page and the software you can usually find here.

The most recent package builds for Linux and Windows are available here: gRchain Builds

People:

  • Thomas Rusch
  • Marcus Wurzer
  • Kathrin Gruber
  • How to use gRchain

    This is a short document intended to introduce the reader into the usage of the gRchain package. Currently, it supports block recursive chain graph models (Cox & Wermuth, 1996) by via the main function coxwer().

    Introduction

    In complex research problems that involve a large number of potentially important variables or feature a complicated dependence structure, the joint probability distribution of the involved random variables can only be unsatisfactorily modelled with classical statistical models. Often some structure in the joint distribution allows to factorize it into conditionally independent components which gave rise to a class of models known as Graphical Models. By representing the joint distribution as a graph with nodes and edges, Graphical Models can exploit possible conditional independence structures in the joint distribution and allow restoring the joint distribution from the components.

    A subclass of Graphical Models, Chain Graph Models, can be of particular interest for problems. A chain graph is a graph which may have both directed and undirected edges but is devoid of any directed cycles. In these models, a researcher can use substantive knowledge to categorize the variables as purely explanatory (predictor), purely dependent or target (response) or intermediate (response and predictor in turn). Each of these variables is assigned to a certain block, based on a partial ordering of the variables, meaning that the ordering is present between blocks but not within blocks. This approach leads to a chain of relationships between the different variables.

    The challenging task of fitting a full Chain Graph Model to the data is facilitated by the factorization property which allows maximizing the joint likelihood by reducing the problem to maximizing the likelihood for each factorized submodel. However, in case of different variable types in the same block (e.g., metric and categorical variables) ML estimation using the direct factorization strategy often does not converge or can computationally be very expensive. As a remedy, Cox & Wermuth (1996) propose the heuristic usage of a system of univariate models for each factorized component.

    In this vignette we introduce an R implementation of the Cox-Wermuth selection strategy that allows to fit a Chain Graph Model for metric and categorical random variables from exponential families and thus incorporates the class of Generalized Linear Models in the chain. For illustration we apply the procedure to a highly multivariate data set that features many different types of variables.

    Graphical Models

    Graphical models (GM) allow multivariate analysis of complex dependency structures. They are probability distributions over a multidimensional space encoded by graphs (as a set of vertices/variables, \(V\), and a set of edges/relationships between variables, \(E\)). There are different types:

    GM represent multivariate dependencies by conditional dependence and independence statements. Thus they can help in reducing overall complexity and allow model formulation, identification and selection.

    A simple graphical model (a Markov random field):

    In GM the Markov property of graphs allows to factorize the distribution \(F_V\) into a set of conditional distributions, e.g., for \(V=\{A,B,C,D\}\) by way of densities: \(f_V=f_{A|B} \times f_{B|C} \times f_{C|D} \times f_D\). Thus the problem of fitting graphical models effectively reduces to estimating a series of conditional distributions.

    Chain Graph Models

    Chain graph models (CGM) are a mixture of directed and undirected graphical models. They are particularly interesting for social and behavioral sciences (observational studies, complex multivariate dependencies, existing substantive knowledge). In CGM, all variables are assigned to boxes (disjoint variable subsets \(V_t, V=\bigcup_t V_t\) by theory or substantive knowledge. Between boxes exist directed edges, within boxes the edges are undirected,

    Two types of CGM:

    Factorization

    A joint response chain graph model:

    In CGM factorization happens at least recursively between blocks: \(f_{V}=f_{V_{T}|{V_{T-1},\dots, V_1}} \times f_{V_{T-1}|{V_{T-2},\dots, V_1}} \times \dots \times f_{V_1}\). Possibly additional conditional independence by missing edges, e.g., for the above graph \(f_{V}=f_{F| C,E,D,A,B} \times f_{C,E,D|A,B} \times f_{A,B}=f_{F|C,E} \times f_{C,D|A,B} \times f_{E|B} \times f_{A,B}\).

    Estimation

    For CGM there are no theoretical restrictions on the form of the conditional distributions (though usually conditional Gaussian distributions; Lauritzen & Wermuth, 1989). In particular variable types can be of mixed type within and between boxes (discrete and continuous components). General algorithms for computing estimates in every CGM under every possible variable type specification is challenging. One can attempt to fit the conditional distributions of the factorization with a series of multiple univariate conditional regressions (Wermuth & Cox, 2001). These are called traceable regressions.

    Cox & Wermuth (1996; see also Caputo et al., 1997) lay out ideas for a heuristic selection strategy to approximate the CGM by univariate conditional regressions which is implemented in coxwer.

    The coxwer Selection Algorithm:

    The coxwer Functionality

    We implemented an algorithm based on the ideas of the Cox-Wermuth heuristic in R for approximate fitting of JRCG and URRG models.

    Currently, there are the following functions intended for the user:

    Object Description
    cw-class S3 class for objects from a Cox-Wermuth fit
    coxwer Fit a JRCG or a URRG via Cox-Wermuth selection strategy
    summary, print, plot, predict S3 methods for class cw
    adjmatrix Extracts the adjacency matrix
    write_cw Writes and saves the graph in igraph format
    prep_coxwer Setup of variable frame, block membership and variable type (interactive)

    Using the Function

    There are two ways of using the coxwer function: either with a formula or with a var.frame. We will use the Contraceptive Method Choice (CMC) data set for illustration which is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey (Lim et. al., 1999). Overall there are 1473 observations of married women on 10 variables:

    We assume the following block structure:

    Blocks must always be numbered like this: Increasing integers starting from 1. The purely target variables are always in block 1, the purely explanatory variables are always in Block K where K is the highest block number (here 5). The blocks in between are labeled in the reverse direction of dependencies. We will abreviate this with Bx which stands for block x (Vx stands for variable x; so B5V2 is Block 5 Variable 2). Variables in BK have no antecedents (but possibly undirected dependencies within the block), variables in BK-1 have all variables in BK as antecedents and variables in BK-1 as possible undirected dependencies, variables in BK-2 have all variables in BK and BK-1 as antecedents and variables in BK-2 as possible undirected dependencies and so on until variables in B1 which have all variables in higher numbered blocks as antecedents.

    The model is therefore of this type (we use ~ for directed edges, and + for undirected edges; the brackets are there for easier readability)

    (B1V1 + B1V2 + ...) ~ (B2V1 + B2V2 + ...) ~ (B3V1 + B3V2 + B3V3 + ...)

    For the example it is

    B1V1 + B1V2 ~ B2V1 ~ B3V1 ~ B4V1 + B4V2 + B4V3 + B4V4 + B4V5 ~ B5V1

    Fitting the model with a formula

    The formula interface to coxwer basically allows to specifiy the directed dependencies between block with the operator ~ and the undirected within the block by +. The structure is therefore:

    B1V1 + B1V2 + ... ~  B2V1 + B2V2 + ... ~  B3V1 + B3V2 + B3V3 + ...

    Or for the example

    contraceptive + nrChild ~ mediaExp ~ solIndex ~ wifeRel + wifeWork + husbOcc + wifeEdu + husbEdu ~ age

    One can the use the function with

    data(cmc)
    rescmc <- coxwer(contraceptive + nrChild ~              #Block 1
                     mediaExp ~                             #Block 2
                     solIndex ~                             #Block 3
                     wifeRel + wifeWork + husbOcc + wifeEdu + husbEdu ~ #Block 4 
                     age,                                   #Block 5
                     data=cmc)

    Note that when using the formula interface, it might be a good idea to specify the vartype argument (see below).

    Fitting the model with a var.frame

    Here the coxwer argument is a variable frame and an observations \(\times\) variables data frame. The variable frame defines the block and type of a variable. It must have the same row names as the data frame has column names. The var.frame must have the following structure: It needs two columns names type and block, which specify the variable type (and thus the model used; see below) and the block to which each variable belongs. The rownames of the var.frame must match the colnames of the data frame. type should be character (so use stringsAsfactors=FALSE when manually setting it up with data.frame()) and block must be whole numbers starting from 1 in increments of 1. Subsequently there is an example for the cmc data

    cmc_prep
    ##                type block
    ## age            cont     5
    ## wifeEdu         ord     4
    ## husbEdu         ord     4
    ## nrChild       count     1
    ## wifeRel         bin     4
    ## wifeWork        bin     4
    ## husbOcc       categ     4
    ## solIndex        ord     3
    ## mediaExp        bin     2
    ## contraceptive categ     1

    The prep_coxwer function allows to define the variable frame interactively.

    cmc_prep <- prep_coxwer(cmc)

    Once the var.frame is set up one can fit the model by

    data(cmc)
    rescmc <- coxwer(var.frame=cmc_prep, data=cmc)

    Further arguments to coxwer

    coxwer allows to specify further arguments:

    Depending on the type of variable, coxwer can use different univariate models

    Predictors we treat as metric or as ordered/unordered factors (dummy–treatment–coding by default).

    Using exponential families with canonical links assures that properties of conditional Gaussian graphs are approximately retained even when fitted with the CW procedure.

    Worked Example

    We now fit the model to the cmc data by using the formula interface.

    data(cmc)
    rescmc <- coxwer(contraceptive + nrChild ~ mediaExp ~ solIndex ~ 
                     wifeRel + wifeWork + husbOcc + wifeEdu + husbEdu ~ age, 
                     data=cmc)
    ## Some models for continuous/metric variables are not further specified. Ordinary least squares estimation is used by default. You can specify a different model type with the 'vartype' argument.
    ## TARGET: contraceptive (multinomial logit model)
    ## TARGET: nrChild (ordinary least squares model)
    ## TARGET: mediaExp (binomial logit model)
    ## TARGET: solIndex (proportional odds logit model)
    ## TARGET: wifeRel (binomial logit model)
    ## TARGET: wifeWork (binomial logit model)
    ## TARGET: husbOcc (multinomial logit model)
    ## TARGET: wifeEdu (proportional odds logit model)
    ## TARGET: husbEdu (proportional odds logit model)

    S3 methods are available. Print prints the adjacency matrix.

    print(rescmc) 
    ## Adjacency Matrix:
    ## 
    ##                   1  2  3  4  5  6  7  8  9 10
    ##  1 contraceptive  0  1  0  0  0  0  0  0  0  0
    ##  2 nrChild        1  0  0  0  0  0  0  0  0  0
    ##  3 mediaExp       0  0  0  0  0  0  0  0  0  0
    ##  4 solIndex       0  0  1  0  0  0  0  0  0  0
    ##  5 wifeRel        0  1  0  1  0  0  1  1  1  0
    ##  6 wifeWork       0  1  0  0  0  0  0  1  0  0
    ##  7 husbOcc        0  0  0  1  1  0  0  1  1  0
    ##  8 wifeEdu        1  1  1  1  1  0  1  0  1  0
    ##  9 husbEdu        0  0  0  1  0  0  1  1  0  0
    ## 10 age            1  1  1  1  1  0  1  1  1  0

    It has to be read the following way: rows encode whether a directed edge from a row variable to a column variable exists. Then the entry is 1, an edge from one variable to another; columns encode whether the variable has an edge coming towards it from a row. If it is 0 no edge exists. In the example, column ten (for age) has only 0 in the column as it is purely explanatory, so no edges lead to it. Conversely, it has edges that lead to all variables but wifeWork (seen in row 10). wifeWork is not explained by anything here (columns) but is associated with nrChild, and wifeEdu. Note that the adjacency matrix needs not be symmetric.

    One can summarize the object by extracting the model summaries for specific target variables by name or for all with target="all". If no target is given, nothing is returned.

    summary(rescmc,target=c("contraceptive","nrChild"))
    ## ---------- Summary for target variable: contraceptive ----------
    ## Call:
    ## nnet::multinom(formula = y ~ nrChild + wifeEdu + age + I(poly(nrChild, 
    ##     2)[, 2]), data = data, Hess = TRUE, model = TRUE, trace = FALSE, 
    ##     MaxNWts = dim(data)[1] + 1)
    ## 
    ## Coefficients:
    ##   (Intercept) nrChild wifeEdu2 wifeEdu3 wifeEdu4      age
    ## 2      -2.293  0.3578   0.8821   1.8373    3.096 -0.04836
    ## 3       1.745  0.3558   0.2366   0.6443    1.337 -0.11909
    ##   I(poly(nrChild, 2)[, 2])
    ## 2                   -25.60
    ## 3                   -26.44
    ## 
    ## Std. Errors:
    ##   (Intercept) nrChild wifeEdu2 wifeEdu3 wifeEdu4     age
    ## 2      0.5139 0.04444   0.4047   0.3870   0.3817 0.01212
    ## 3      0.3756 0.04058   0.2482   0.2453   0.2462 0.01137
    ##   I(poly(nrChild, 2)[, 2])
    ## 2                    3.570
    ## 3                    3.224
    ## 
    ## Value/SE (Wald statistics):
    ##   (Intercept) nrChild wifeEdu2 wifeEdu3 wifeEdu4     age
    ## 2      -4.462   8.051   2.1794    4.748    8.112  -3.991
    ## 3       4.646   8.768   0.9532    2.627    5.433 -10.476
    ##   I(poly(nrChild, 2)[, 2])
    ## 2                   -7.171
    ## 3                   -8.202
    ## 
    ## Residual Deviance: 2708 
    ## AIC: 2736 
    ## ---------- Summary for target variable: nrChild ----------
    ## 
    ## Call:
    ## stats::glm(formula = y ~ contraceptive + wifeRel + wifeWork + 
    ##     wifeEdu + I(poly(age, 2)), family = currfamily, data = data)
    ## 
    ## Deviance Residuals: 
    ##    Min      1Q  Median      3Q     Max  
    ## -5.801  -1.057  -0.167   0.929  10.402  
    ## 
    ## Coefficients:
    ##                  Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)       2.10561    0.22420    9.39  < 2e-16
    ## contraceptive2    1.02986    0.13158    7.83  9.5e-15
    ## contraceptive3    0.99954    0.11312    8.84  < 2e-16
    ## wifeRel1          0.82276    0.14000    5.88  5.2e-09
    ## wifeWork1         0.52631    0.11164    4.71  2.7e-06
    ## wifeEdu2         -0.00114    0.18427   -0.01  0.99508
    ## wifeEdu3         -0.29816    0.18113   -1.65  0.09996
    ## wifeEdu4         -1.11024    0.17797   -6.24  5.8e-10
    ## I(poly(age, 2))1 53.31834    1.94461   27.42  < 2e-16
    ## I(poly(age, 2))2 -6.64633    1.88877   -3.52  0.00045
    ## 
    ## (Dispersion parameter for gaussian family taken to be 3.349)
    ## 
    ##     Null deviance: 8188.4  on 1472  degrees of freedom
    ## Residual deviance: 4899.6  on 1463  degrees of freedom
    ## AIC: 5973
    ## 
    ## Number of Fisher Scoring iterations: 2

    We see that the model for nrChild is an OLS model (due to autodetection). But since nrChild is a count it might be better to tell the function that.

    data(cmc)
    rescmc <- coxwer(contraceptive + nrChild ~ mediaExp ~ solIndex ~ 
                     wifeRel + wifeWork + husbOcc + wifeEdu + husbEdu ~ age, 
                     vartype=c("cate","count","bin","ord","bin","bin",
                               "ord","ord","ord","metric"), 
                     data=cmc)
    ## Some models for continuous/metric variables are not further specified. Ordinary least squares estimation is used by default. You can specify a different model type with the 'vartype' argument.
    ## TARGET: contraceptive (multinomial logit model)
    ## TARGET: nrChild (poisson loglinear model)
    ## TARGET: mediaExp (binomial logit model)
    ## TARGET: solIndex (proportional odds logit model)
    ## TARGET: wifeRel (binomial logit model)
    ## TARGET: wifeWork (binomial logit model)
    ## TARGET: husbOcc (proportional odds logit model)
    ## TARGET: wifeEdu (proportional odds logit model)
    ## TARGET: husbEdu (proportional odds logit model)
    summary(rescmc,target=c("contraceptive","nrChild"))
    ## ---------- Summary for target variable: contraceptive ----------
    ## Call:
    ## nnet::multinom(formula = y ~ nrChild + wifeEdu + age + I(poly(nrChild, 
    ##     2)[, 2]), data = data, Hess = TRUE, model = TRUE, trace = FALSE, 
    ##     MaxNWts = dim(data)[1] + 1)
    ## 
    ## Coefficients:
    ##   (Intercept) nrChild wifeEdu2 wifeEdu3 wifeEdu4      age
    ## 2      -2.293  0.3578   0.8821   1.8373    3.096 -0.04836
    ## 3       1.745  0.3558   0.2366   0.6443    1.337 -0.11909
    ##   I(poly(nrChild, 2)[, 2])
    ## 2                   -25.60
    ## 3                   -26.44
    ## 
    ## Std. Errors:
    ##   (Intercept) nrChild wifeEdu2 wifeEdu3 wifeEdu4     age
    ## 2      0.5139 0.04444   0.4047   0.3870   0.3817 0.01212
    ## 3      0.3756 0.04058   0.2482   0.2453   0.2462 0.01137
    ##   I(poly(nrChild, 2)[, 2])
    ## 2                    3.570
    ## 3                    3.224
    ## 
    ## Value/SE (Wald statistics):
    ##   (Intercept) nrChild wifeEdu2 wifeEdu3 wifeEdu4     age
    ## 2      -4.462   8.051   2.1794    4.748    8.112  -3.991
    ## 3       4.646   8.768   0.9532    2.627    5.433 -10.476
    ##   I(poly(nrChild, 2)[, 2])
    ## 2                   -7.171
    ## 3                   -8.202
    ## 
    ## Residual Deviance: 2708 
    ## AIC: 2736 
    ## ---------- Summary for target variable: nrChild ----------
    ## 
    ## Call:
    ## stats::glm(formula = y ~ contraceptive + wifeRel + wifeWork + 
    ##     wifeEdu + age + I(poly(age, 2)[, 2]), family = currfamily, 
    ##     data = data)
    ## 
    ## Deviance Residuals: 
    ##    Min      1Q  Median      3Q     Max  
    ## -3.362  -0.648  -0.103   0.534   3.591  
    ## 
    ## Coefficients:
    ##                      Estimate Std. Error z value Pr(>|z|)
    ## (Intercept)          -1.22834    0.11021  -11.15  < 2e-16
    ## contraceptive2        0.33405    0.03952    8.45  < 2e-16
    ## contraceptive3        0.34824    0.03575    9.74  < 2e-16
    ## wifeRel1              0.26392    0.04437    5.95  2.7e-09
    ## wifeWork1             0.17109    0.03505    4.88  1.1e-06
    ## wifeEdu2              0.01222    0.05007    0.24     0.81
    ## wifeEdu3             -0.07574    0.04964   -1.53     0.13
    ## wifeEdu4             -0.35135    0.04961   -7.08  1.4e-12
    ## age                   0.05817    0.00212   27.48  < 2e-16
    ## I(poly(age, 2)[, 2]) -5.16323    0.62203   -8.30  < 2e-16
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 2529.0  on 1472  degrees of freedom
    ## Residual deviance: 1452.1  on 1463  degrees of freedom
    ## AIC: 5530
    ## 
    ## Number of Fisher Scoring iterations: 5

    We can use the igraph facilities to plot the model. After some tweeking and ordering the vertices according to the block structure we have

    plot(rescmc)

    With the predict method we can predict the contraceptive use of three fictious women:

    newd<-data.frame(nrChild=c(4,0,1),age=c(50,23,35),husbOcc=factor(c(1,4,3),ordered=TRUE),mediaExp=factor(c(0,1,1)),solIndex=factor(c(2,4,1),ordered=TRUE),wifeRel=factor(c(1,0,1)),wifeWork=factor(c(0,1,0)),wifeEdu=factor(c(1,4,3),ordered=TRUE),husbEdu=factor(c(1,4,3),ordered=TRUE))
    predict(rescmc,target="contraceptive",newdata=newd,type="prob")
    ## ---------- Predicting target variable: contraceptive ----------
    ## $contraceptive
    ##           1         2         3
    ## 1 9.994e-01 2.482e-04 3.450e-04
    ## 2 1.000e+00 2.107e-07 2.471e-07
    ## 3 2.993e-09 2.639e-01 7.361e-01

    The first two women while being very different are both likely not to use contraceptives, whereas woman 3 is likely using short-term contraceptive.

    References