I am currently reading "Statistical and Inductive Inference by Minimum Message Length" by C.S. Wallace. In this, Wallace gives a fairly informal account of Bayesian Inference which, in the case everything is discrete, is basically as follows:

We start with a

*prior*probability distribution over a space of models of interest $\Theta$For each $\theta \in \Theta$, we also know a priori the likelihood $\mathbb{P}(x|\theta)$ of observing data $x$ if the model $\theta$ is "true"

We observe some data $x$

We update our prior distribution with the

*posterior*one using Bayes' rule; that is, we set $$\mathbb{P}(\theta|x) = \frac{\mathbb{P}(x|\theta) \mathbb{P}(\theta)}{\mathbb{P}(x)}$$ for each $\theta \in \Theta$.

In the case that $\Theta$ is continuous, this process is roughly repeated but with density functions replacing $\mathbb{P}$ as appropriate.

Now, Bayes rule requires $\theta$ and $x$ be part of the same sample space. However, our prior distribution describes only models, and not data points. As such, it seems technically necessary to construct a new space which allows us to talk simultaneously about the probability of models and of measuring certain data. My question is: how do we do this in general (or how do we usually do it in practice)?

This question may be somewhat vague, so I have come up with a more concrete example of the sort of thing I want to do, which seems quite general and useful in practice. Suppose we have:

- A probability space $$(\Theta, \mathcal{F}, \mathbb{P})$$ of models of interest (which the $\theta$'s reside in)
- A measurable space $$(\Omega, \mathcal{G})$$ of data points (which the $x$'s reside in)
- A mapping $\mathbb{P}(\cdot|\cdot) : \mathcal{G} \times \Theta \to [0, 1]$ such that $\mathbb{P}( \cdot | \theta)$ is a probability measure on $(\Omega, \mathcal{G})$ for each $\theta \in \Theta$. (This is our likelihood function.)

We want to use this to somehow come up with a probability $\mathbb{P}'$ on $$(\Theta \times \Omega, \mathcal{F} \times \mathcal{G})$$ (where $\mathcal{F} \times \mathcal{G}$ denotes the product $\sigma$-algebra) in a way that preserves the original behaviour of $\mathbb{P}$. I want to say that we should define $$\mathbb{P}'(A \times B) = \int_A \mathbb{P}(B | \theta) \,d\mathbb{P}(\theta),$$ where $A \in \mathcal{F}$ and $B \in \mathcal{G}$ (which adds the requirement that $\mathbb{P}(B | \cdot)$ be $\mathcal{F}$-measurable for each fixed $B \in \mathcal{G}$), but I see a problem in that not all events in $\mathcal{F} \times \mathcal{G}$ have the form $A \times B$.

Can this be done? Or is there a completely separate way to approach Bayesian inference which avoids all these difficulties?

doneed to change the old / cook up a new probability space. In that sense Bayesians say that they model $X$ and $\Theta$ but in fact they ASSUME already that there is a joint distribution and that $f_{X|\Theta}=...$ (the function they want). $\endgroup$