Probabilistic Graphical Models 1 - Representation
About the Course
This course is the first in a sequence of three. It describes the two basic PGM representations: Bayesian Networks, which rely on a directed graph; and Markov networks, which use an undirected graph. The course discusses both the theoretical properties of these representations as well as their use in practice. The (highly recommended) honors track contains several hands-on assignments on how to represent some real-world problems. The course also presents some important extensions beyond the basic PGM representation, which allow more complex models to be encoded compactly.
# Week 1
# Introduction and Overview
# Welcome
- Required Background
- Basic Probability Theory
- Some programming
- Some algorithms and data structures
- Recommended
- Machine Learning
- Simple optimization
- MATLAB or Octave
# Overview and Motivation
- PGM is a framework.
- A model is a declarative representation of our understanding of the world.
- It is a digital representation that captures what the variables are and how they interact with one another.
- As these are probabilistic models, so they deal with Uncertainty.
- Uncertainty comes from
- Partial knowledge of the state of the world.
- Noisy observations.
- Phenomena not covered by our model.
- Inherent stochasticity.
- We can view the world as inherently stochastic, e.g. Quantum Theory.
- Uncertainty comes from
- Probability Theory is a framework that allows us to deal with Uncertainty in ways that are principled and that bring to bear important and valuable tools. They provide us with
- Declarative representation with clear semantics.
- Powerful reasoning patterns.
- Established learning methods.
- The word graphical is here from the perspective of computer science, because probabilistic graphical models are a synthesis between ideas from probability theory in statistics and ideas from computer science.
- The idea here is to use graphs to represent complicated systems that involve a large number of variables.
- Graphical Models
- Bayesian network
- It uses a directed graph as its intrinsic representation.
- The random variables are represented by nodes in the graph.
- The edges represent the probabilistic connections between those random variables.
- Markov network
- It uses an undirected graph.
- Bayesian network
- The graphical representation gives us
- An intuitive and compact data structure.
- Efficient reasoning using general-purpose algorithms.
- Sparse parameterization, which allows us to
- do feasible elicitation, by hand from an expert.
- learn from data, automatically.
- This is a very general framework, applied to numerous domains and problems.
# Distributions
- Joint Distribution
- Independent parameters are those parameters whose values are not completely determined by the value of other parameters.
- We can condition a probability distribution on a particular observation.
- Marginalization is an operation that takes a probability distribution over a larger subset of variables and produces a probability distribution over a subset of those.
# Factors
- A factor is a function or a table, that takes a bunch of arguments, called scope, and gives a value for every assignment to those arguments (random variables).
- Scope is the set of arguments that a factor takes.
- Examples of factors:
- Joint Distribution
- Unnormalized measure
- Conditional Probability Distribution (CPD)
- Operation on Factors
- Factor Product
- Takes two factors, $\phi_1$ and $\phi_2$, and multiplies them together.
- Factor Marginalization
- To marginalize out an argument from a vector, take all the values of that argument, and sum them together.
- Factor Reduction
- Focuses on all rows containing a particular value, e.g. $c^1$ in the example below.
- Factor Product
- Why factors?
- Fundamental building block for defining distributions in high-dimensional spaces.
- Set of basic operations for manipulating these probability distributions.
# Bayesian Network Fundamentals
# Semantics & Factorizations
- A Bayesian network is a directed acyclic graph (DAG) $G$ whose nodes represent the random variables $X_1, \dots, X_n$. For each node $X_i$, there is a CPD that denotes the dependence of $X_i$ on its parents in the graph, i.e., $P(X_i | Par_G(X_i))$.
- It represents a Joint Probability Distribution via the Chain Rule for Bayesian networks: $$P(X_1, \dots, X_n) = \prod_i P(X_i | \text{Par}_G(X_i)).$$
- A distribution $P$ factorizes over $G$, where $G$ is a graph over $X_1, \dots, X_n$, if $$P(X_1, \dots, X_n) = \prod_i P(X_i | \text{Par}_G(X_i)).$$
# Reasoning Patterns
- Causal Reasoning
- The reasoning goes is the causal direction, i.e. from top to bottom.
- Evidential Reasoning
- The reasoning goes from bottom up.
- Given the evidence of child, ask what happens to the probabilities of its parents.
- Intercausal Reasoning
- It is the flow of information between two causes of a single effect.
# Flow of Probabilistic Influence
- [I] Probabilistic influence is symmetric.
- When can $X$ influence $Y$?
- $X \rightarrow Y$ (X is a parent of Y).
- $X \leftarrow Y$ (X is a child of Y).
- $X \rightarrow W \rightarrow Y$.
- $X \leftarrow W \leftarrow Y$.
- $X \leftarrow W \rightarrow Y$ (common cause $W$ which has two effects $X$ and $Y$).
- A v-structure is the one where two causes have a joint effect, i.e. $X \rightarrow W \leftarrow Y$.
- In case of v-structure, $X$ can’t influence $Y$.
- A trail is a sequence of nodes that are connected to each other by single edges in the graph.
- An Active Trail is trail that has no v-structures, i.e. a trail $X_1 - \dots - X_k$ is active if it has no v-structures $X_{i-1} \rightarrow X_i \leftarrow X_{i+1}$.
- When can $X$ influence $Y$ given evidence $Z$?
- A trail $X_1 - \dots - X_k$ is active given $Z$ if:
- for any v-structure $X_{i-1} \rightarrow X_i \leftarrow X_{i+1}$ we have that $X_i$ or one of its descendants $\in Z$.
- no other $X_i$ is in $Z$.
# Bayesian Networks: Independencies
# Conditional Independence
- What is Independence?
- For events $\alpha, \beta$, $P \models \alpha \bot \beta$ if:
- $P(\alpha, \beta) = P(\alpha)P(\beta)$
- $P(\alpha | \beta) = P(\alpha)$
- $P(\beta | \alpha) = P(\beta)$
- For random variables $X, Y$, $P \models X \bot Y$ if:
- $P(X, Y) = P(X)P(Y)$
- $P(X | Y) = P(X)$
- $P(Y | X) = P(Y)$
- For events $\alpha, \beta$, $P \models \alpha \bot \beta$ if:
- What is Conditional Independence?
- For (sets of) random variables $X, Y, Z$, $P \models (X, \bot Y | Z)$ if:
- $P(X, Y | Z) = P(X | Z)P(Y | Z)$
- $P(X | Y, Z) = P(X | Z)$
- $P(Y | X, Z) = P(Y | Z)$
- $P(X, Y, Z) \propto \phi_1 (X, Z) \phi_2 (Y, Z)$
- For (sets of) random variables $X, Y, Z$, $P \models (X, \bot Y | Z)$ if: