Indirect Causes, Dependencies and Causality in Bayesian Networks

Research, Results and Digital Supplements
by Alexander Motzek

Purpose of this website

This website presents research results, interests and digital supplements to existing works on the research areas of Alexander Motzek, PhD candidate at the University of Lübeck, Institute of Information Systems. Research results and interests include the areas of probabilistic graphical models, artificial intelligence, cyber security, mission impact assessments and computer vision

This website further features details and supplements on my PhD thesis entitled ``Indirect Causes, Dependencies and Causality in Bayesian Networks''.

For my personal website, please go to www.motzek.org.

Indirect Causes, Dependencies and Causality in Bayesian Networks

Modeling causal dependencies in complex or time-dependent domains often demands cyclic dependencies. Such cycles arise from local points of views on dependencies where no singular causality is identifiable, i.e., roles of causes and effects are not universally identifiable. Modeling causation instead of correlation is of utmost importance, which is why Bayesian networks are frequently used to reason under uncertainty. Bayesian networks are probabilistic graphical models and allow for a causal modeling approach with locally specifiable and interpretable parameters, but are not defined for cyclic graph structures. If Bayesian networks are to be used for modeling uncertainties, cycles are eliminated with dynamic Bayesian networks, eliminating cycles over time. However, we show that eliminating cycles over time eliminates an anticipation of indirect influences as well, and enforces an infinitesimal resolution of time. Without a ``causal design,'' i.e., without representing direct and indirect causes appropriately, such networks return spurious results.

In particular, the main novel contributions of this thesis can be summarized as follows. By considering specific properties of local conditional probability distributions, we show that a novel form of probabilistic graphical models rapidly adapts itself to a specific context at every timestep and, by that, correctly anticipates indirect influences under an unrestricted time granularity, even if cyclic dependencies arise.

PhD Thesis: Digital Editions

If you prefer reading on the screen, want to search something or view all figures on more detail, you can download the following PDFs, which represent my thesis at the state of final publishment.

In the printerfriendly version, figures containing large amounts of transparent layers have been flattened and rendered at 600dpi. Moreover, a slightly brighter portrait has been used in the curicuum vitae.

I defended my thesis successfully on the 19th of October, 2016. Here are my slides.

Publications

To provide ease of access to my publications that were created as part of my thesis, I provide all PDFs of published and accepted papers here.

The following journal articles and papers have been submitted and are currently under review. As these are yet unpublished I provide these PDFs here for reference. Note that these articles are not referenced in the bibliography of my thesis.

History

  • November 11th, 2016: My Journal article "Context- and Bias-Free Probabilistic Mission Impact Assessment" was accepted for publication at CoSe!
  • October 19th, 2016: 10:00 Room Kaminsky, University of Lübeck: Defense of my PhD.
  • October 18th, 2016: I revised my article submitted to JAIR.
  • October 10th, 2016: I revised my article submitted to Computers and Security.
  • October 1st, 2016: I received great feedback by Computers and Security.
  • September 11th, 2016: I got accepted at the NATO IST-148 Symposium! Slides and CRV added above.
  • September 9-19th, 2016: I am on vacation on La Gomera, Canary Islands, Spain :)
  • August 29th, 2016: I presented my paper at the CBI 2016. Slides added above!
  • August 19th, 2016: I submitted my manuscript on learning ADBNs to the Journal of Bayesian Analysis.
  • August 16th, 2016: We have been accepted to NordSec 2016!
  • August 14th, 2016: I wrote and submitted a novel research paper on Probabilistic Mission Defense and Assurance to the NATO IST-148 Symposium on Cyber Defence Situation Awareness. The aim is to assure mission success without sacrificing missions for security.
  • August 3rd, 2016: I received great feedback from Computers & Security providing helpful comments for the revision of "Context- and Bias-Free Probabilistic Mission Impact Assessment." The link above has been updated with the revision.
  • July 5, 2016: I coauthored another paper on "Towards an Automated and Dynamic Risk Management Response System." Link has been added above.
  • July 11, 2016: My doctoral examination procedure is now official!
  • June 22, 2016: Dissertation submitted.

Applications for Exact and Approximate Inference

To validate the theoretical complexity analysis of exact and approximate inference, derived in Chapter 2, all algorithms have been implemented in plain C to perform inference in the taintedness domain. The application generates random instances of the taintedness domain for N employees, and measures the computation time for solving filtering and smoothing problems using exact inference and compares them to the inference accuracy and computation time of approximate inference. As the taintedness domain is represented by a dense intra-timeslice ADBN, this application can be used for general inference using dense intra-timeslice ADBNs.

In the following, I provide all applications as source code and compiled binaries to validate my obtained results. Further, I provide all generated data from which the figures were generated. For every binary I provide a sample output(s) of the application for adhoc reference.

As exact inference is computationally expensive and has large memory requirements for storing results, i.e., joint probability distributions, inference is implemented in plain C, without object-orientation, and is based on various bit-maps and bit-flags involving various bit-shifting and bit-masking operation. In order to understand the implementation it is crucial to understand the employed bit-encoding:

In order to encode observations, joint probability distribution, and conditional probability distributions, an encoding of an instantiation of all random variables in a timeslice is required. An instantiation of a timeslice is encoded as an unsigned long for N<6, and as an unsigned long long for N>=6. On the employed windows machine, these are 32Bit for N<6 and 64Bit for N>=6. Every bit represent one boolean random variable in the following ordering:

instantiation = [XXXXXX][A23A13A03][A32A12A02][A31A21A01][A30A20A10]
                [allX][AsOfX3][AsOfX2][AsOfX1][AsOfX0]

Likewise, every conditional probability distribution P(X|\vec{Z}) is encoded by an enumerate random variable identifier for identifying X, and a bitvector representing \vec{Z}, where \vec{Z} is encoded similiarly as

dep = [otherX][activatorsOfX][Xpreviously]

Similarly, observations of a timeslice are encoded as bit masking vectors. A "true"-observation of the i-th random variable (according to the enumeration of an instantiation) is encoded in a bitmasking vector "evidenceTrue" by a logical 1 on the i-th bit of evidenceTrue. Likewise, a "false"-observation of the i-th random variable is encoded in a bitmasking vector "evidenceFalse" by a logical 0 on the i-th bit. Therefore, every instantiation inst is conforms with the made observations evidenceTrue and evidenceFalse, if

(inst == inst|evidenceTrue) && (inst == inst&evidenceFalse) .

Note: Exact inference in dense intra-timeslice ADBNs is intractable for N>5 and approximate inference techniques are required. To perform approximate inference in dense ADBNs with more than 64 random variables, the type definition used to represent an instantiation must be increased to contain and appropriate amount of bits.

Different compilation flags and parameters are used to generate evaluations shown in Figures 2.3-2.8. The following sections describe these situations in order to recreate and validate the obtained results. I provide compiled binaries for x64 windows machines, compiled using Visual Studio 2015 and the complete Visual Studio Solution:

Figures 2.3-2.6 (Exact Inference, N=4)

//#define ONLYPARTICLE 1
#define N 4
Using argument 0, this application evaluates exact inference as in Figures 2.3-2.5. It creates 50 experiments, where online filtering and offline smoothing problems are solved over 40 timeslices in the taintedness domain with four employees (16 random variables). Each experiment creates one tab-separated file containing various measurements of the inference process.
Using argument 1, this application evaluates approximate inference against exact inference in terms of accuracy and computation time. Every execution creates 10 experiments, evaluates over 250 timeslices. It evaluates SIS and SIR and scales from 1000 to 1 000 000 samples. Four employees evaluated in the taintedness domain. Each experiment creates a tab-separated file for each number of samples, for each approximation techniques. Each file contains various measurements of the exact and approximate inference process.

Figure 2.8 (Exact Inference, N=5)

//#define ONLYPARTICLE 1
#define N 5
Only argument 1 (approximate inference) is supported. It evaluates approximate inference using SIR with 1 000 000 and 10 000 000 samples against exact inference (filtering). Performs 10 experiments. Warning: Every experiments takes about 2 hours. Creates the same file structure as above.

Figure 2.7 (Approximate Inference, 9 to 49 random variables)

#define ONLYPARTICLE 1
//define desired number of employees
#define N 3 //or 4,5,6,7
Only argument 1 (approximate inference) is supported. This evaluation is targeted to test the computational complexity in terms of number of random variables using SIR. The accuracy is not compared to exact inference, as exact inference would take about 6 days for 36 random variables. Every execution generates 10 experiments. In every experiment, 50 online filtering problems (50 timeslices) are solved. Creates the same file structure as above, but measurements on exact inference are blank.

Applications for Learning ADBNs

To validate derived algorithms for learning ADBNs in Chapter 3, the derived learning algorithm has been implemented in plain C. The application creates instances of the taintedness domain with N employees (as in the previous application) and generates datasets over long periods of time. Consecutively, instantiations from these datasets are removed to create incomplete datasets with hidden variables. Following, various forms of ADBNs are learned from these datasets. Every learned model is compared to the original models based on the difference of parameters and their ability to achieve the same inference results.

The implementation follows the same bitvector and bitmaskings encodings as previously. Various different compilation settings are required to evaluate different situations of incomplete data and various ADBN models. The following sections discuss these settings to obtain results as shown in Figure 3.1-3.3. Again, I provide compiled x64 windows machine binaries obtained from the following Visual Studio Solution:

Figure 3.1 (Learning from complete datasets)

//General settings
#ifdef EXPERIMENT_FULLY_OBSERVED
#define N 4
#define defMaxLearnRounds 1
#define OBSERVE_TRUE_ACTIVATOR_FOR_LEARNING 1
//#define ACTIVATORMAP 0.5
#define tMaxLearnData 50000
#endif
In this setting, 50 000 datapoints (timeslices) are generated for learning. The complete dataset is used for learning different forms of ADBNs, discussed in the following settings. Four employees are simulated in the taintedness domain. Every learned model is evaluated against the original model over 200 timeslices. One needs to expect that a learned model comes to the same conclusions, i.e., similar filtering and smoothing distributions, as the original model. This difference/similarity is measured by the Hellinger distance, where a Hellinger distance of 0 exists between two probability distributions that account exactly the same probabilities to all instantiations. The maximum Hellinger distance of 1 exists between two distributions where one distributions neglects the probability of all instantiation which are said to be possible by the other distribution.
Each experiment creates four tab-separated files: The convergence of learned parameters of each model towards the original parameters (two files), evaluation of filtering results over multiple timeslices for each model (combined in one file), and evaluation of smoothing results multiple timeslices for each model (combined in one file). Note that the convergence results only contain one line of measurements (multiple learn-rounds in complete datasets are not required, as no missing values must be restored).
#define ENFORCE_ACTIVATOR_FOR_DIAGONAL 1
//#define USE_TREE_STRUCTURED_DIAGONAL 1
#define EXPERIMENT_FULLY_OBSERVED 1
Learns a cyclic intra-timeslice ADBN and a diagonal inter-timeslice ADBN from complete datasets.
#define ENFORCE_ACTIVATOR_FOR_DIAGONAL 1
#define USE_TREE_STRUCTURED_DIAGONAL 1
#define EXPERIMENT_FULLY_OBSERVED 1
Learns a cyclic intra-timeslice ADBN and a tree-structured ADBN from complete datasets. The tree-structured ADBN is implemented by a modification to the method that handles conditional probability distributions for every random variable.
//#define ENFORCE_ACTIVATOR_FOR_DIAGONAL 1
//#define USE_TREE_STRUCTURED_DIAGONAL 1
#define EXPERIMENT_FULLY_OBSERVED 1
Learns a cyclic intra-timeslice ADBN and a diagonal inter-timeslice ADBN from complete datasets, but the diagonal model is not bound to activator constraints. As discussed in thesis, no significant differences exist between an activator-bound and activator-unbound diagonal model, as indirect influences are completely ignored.
#define ENFORCE_ACTIVATOR_FOR_DIAGONAL 1
//#define USE_TREE_STRUCTURED_DIAGONAL 1
#define EXPERIMENT_FULLY_OBSERVED 1
#define defMaxLearnRounds 0
For reference, results of completely random CPDs are generated by this setting. CPDs are randomly instantiated at first and are consecutively replaced by the learned CPD parameters. Here, no learning is performed and CPDs are left in their initial random instantiation.

Figures 3.2 & 3.3 (Learning from incomplete datasets)

//GENERAL SETTINGS!
#define ENFORCE_ACTIVATOR_FOR_DIAGONAL 1
//#define USE_TREE_STRUCTURED_DIAGONAL 1
Figures 3.2 and 3.3 show evaluations of the learning procedure from incomplete datasets. To generate these evaluations, the abovementioned global flags are required. In every experiment, each generated dataset is manipulated and certain instantiations are excluded. The following settings evaluate different forms of exclusions, as discussed in the thesis. For these experiments, 10 000 datapoints are used, obtained from multiple instances of the taintedness domain with three employees.
When dealing with incomplete datasets, multiple learn-rounds are required, where learned parameters converge to the original parameters. As previously, every learned model is evaluated against the original model over 200 timeslices. One needs to expect that a learned model comes to the same conclusions, i.e., similar filtering and smoothing distributions, as the original model. This difference/similarity is measured by the Hellinger distance, where a Hellinger distance of 0 exists between two probability distributions that account exactly the same probabilities to all instantiations. The maximum Hellinger distance of 1 exists between two distributions where one distributions neglects the probability of all instantiation which are said to be possible by the other distribution. For Figure 3.2, the plots of convergences with fewer learnrounds than 100, have been continued by their latest value, i.e., might have converged lower, but formed a significant plateau. For Figure 3.3, EXPERIMENT_HIDDEN_XA is performed on 40 learn rounds to achieve a similar amount of experiments in the same computation time compared to the other experiments.
#define EXPERIMENT_FULLY_OBSERVED 1
#ifdef EXPERIMENT_FULLY_OBSERVED
#define N 3
#define defMaxLearnRounds 1
#define OBSERVE_TRUE_ACTIVATOR_FOR_LEARNING 1
//#define ACTIVATORMAP 0.5
#define tMaxLearnData 10000
#endif
For a lower-bound reference, this setting evaluates learning from a complete dataset in this setting. Note that only one learning round is required. Each execution performs 50 experiments. Note that N and tMaxLearnData changed from the previous experiments on complete data!
#define EXPERIMENT_FULLY_OBSERVED 1
#define DO_NOT_LEARN 1
For an upper-bound reference, this setting generates results from completely random CPDs reference plot. This means that the difference of multiple randomly generated CPDs towards the original CPDs is measured, and inference results based on these random CPDs are compared to the original models. Each execution performs 50 experiments.
#define EXPERIMENT_HIDDEN_X 1
In this setting, state variables (X) are randomly excluded from the learning set. 25 learn-rounds are performed in each experiments. Each execution performs 25 experiments.
#define EXPERIMENT_HIDDEN_A 1
#define defMaxLearnRounds 25
In this setting, activator random variables are excluded randomly from the learning set. This means that the structure is partially unknown and must be restored. Each execution generates 10 experiments and uses 25 learn-rounds.
#define EXPERIMENT_HIDDEN_A 1
#define defMaxLearnRounds 50
Same as above, but with 50 learn-rounds. No significant difference is evident.
#define EXPERIMENT_HIDDEN_XA 1
#define defMaxLearnRounds 100
This setting is a combination of the three above. State variables (X) and activator random variables are randomly executed from each instantiation, and only regularity assuring instantiations remain. Note that tMaxLearnData was not increased, i.e., still 10000 timeslices are used for learning, from which instantiations are excluded. Effectively, significantly less individual datapoints remain for learning; hence, the lower accuracy evident from Figure 3.3. In this setting, 100 learn-rounds are used, whose results are used in Figure 3.2. Eacg execution generates 10 experiments.
#define EXPERIMENT_HIDDEN_XA 1
#define defMaxLearnRounds 40
Same as above, but for 40 learning rounds. Used for inference accuracy in Figure 3.3, to achieve similar computation time required for performing multiple experiments as in the other experiments.