The Original Information Bottleneck Paper!


If you (the reader) are wondering about the exclamation mark in the title, it's because we're now reviewing a paper that is pivotal in the field of sensory decoding -- the information bottleneck paper by Tishby et al. A lot of problems in this field is related to efficient information transmission, so it should not come as a surprise that the information bottleneck problem arises from optimizing a compression process. Think about this following dilemma: when compressing, we want to use the least amount of letters (from a code book) to represent a signal, but we also want the distortion after compression to be minimized. The trade-o between the two desirable features are unavoidable, however, and so typically the question is framed in this manner: "how can I minimize the letters in a code book given that the distortion must be less than some number $D$?". This leads us to the famous rate-distortion function $R(D)$. However, the rate-distortion theory requires a pre-assigned distortion function, which assumes that we know what information is relevant for the given stimulus. This is often not the case. Therefore, in this paper, Tishby et al. proposed to add an additional variable that determines the relevant information that we want to keep. Now, let us take look at the beautiful math they've worked out! I'm assuming some basic knowledge in information theory from the reader, so if that is not the case, you can skip over it (or better yet, go and be acquainted with information theory some more! It is quarantine time after all.).

Let's first find number of letters we need. Let $X$ be the signal we want to compress, and $\widetilde{X}$ the compressed signal. The average volume of $X$ is $2^{H(X)}$, where $H(X)$ is the entropy of $X$. When compressing, the average number of elements that go into the same letter has the volume $2^{H(X\mid \widetilde{X})}$, where $H(X\mid \widetilde{X})$ is the conditional entropy of $H$ given $\widetilde{X}$. This means that the number of partitions we will need is $2^{H(X)}/2^{H(X\mid \widetilde{X})}=2^{I(X;\widetilde{X})}$, where $I(X;\widetilde{X})$ is the mutual information of $X$ and $\widetilde{X}$. $I(X;\widetilde{X})$ is thus the lower bound of the information rate, which tells us the number of transmissions required to specify a letter in the code book.

Now let us calculate the distortion. Given some distortion function $d:X\times \widetilde{X}\rightarrow R^{+}$, we can calculate it's expected value:
$$\left \langle d(x,\widetilde{x}) \right \rangle=\sum_{x\in X}^{}\sum_{\widetilde{x}\in \widetilde{X}}^{}p(x,\widetilde{x})d(x,\widetilde{x})\; \; \; \; \; \; \; \; \; \; (1)$$
With these two quantities in mind, we can move on to determine the functional we want to minimize. What we want is the rate distortion function: $R(D)=min_{p(\widetilde{x}\mid x):\left \langle d(x,\widetilde{x}) \right \rangle\leqslant D}I(X;\widetilde{X})$, which can be solved if we introduce a Lagrange multiplier $\beta $ on the constraint term $\left \langle d(x,\widetilde{x}) \right \rangle$. This gives us:
$$F[p(\widetilde{x}\mid x)]=I(X;\widetilde{X})+\beta \left \langle d(x,\widetilde{x}) \right \rangle_{p(x,\widetilde{x})}\; \; \; \; \; \; \; \; \; \; (2)$$
The theorems and lemmas involved in solving this functional is worth checking out, but we will not talk about that here. So far, note that the relevance of the information is not yet addressed. Let us then assign $Y$ as the variable that carries relevant information. The rest of the logic flow is similar to those mentioned above, except that instead of a fixed distortion, what we want now is a fixed amount of relevant information. The functional then becomes:
$$L[p(\widetilde{x}\mid x)]=I(X;\widetilde{X})-\beta I(\widetilde{X};Y)\; \; \; \; \; \; \; \; \; \; (3)$$
There is an exact formal solution for this:
$$p(\widetilde{x}\mid x)=\frac{p(\widetilde{x})}{Z(x,\beta )}exp[-\beta \sum_{y}^{}p(y\mid x)log\frac{p(y\mid x)}{p(y\mid \widetilde{x})}]\; \; \; \; \; \; \; \; \; \; (4)$$
Again, there are multiple theorems involved in solving this in Tishby et al.'s original papers as linked below. The main point of the paper is that it addresses the problem of pre-determining the distortion function in rate-distortion theory, and asks for the maximum amount of compression given that certain meaningful information has to be preserved. This is interesting to sensory encoding and decoding, since it is essentially the problem sensory systems face, when trying to lter a large amount of information into a tiny portion so that it can be sent to the brain for further computation.





Original paper: Tishby, Naftali, Fernando C. Pereira, and William Bialek. "The information bottleneck method." arXiv preprint physics/0004057 (2000).

留言