IT is a beautiful sub-field of CS with applications across the gamut of scientific fields: coding theory and communications (under unreliable channels), cryptography, physics, biomedical engineering, computer graphics, machine learning, statistics, and even gambling and stock trading. It is truly a marvelous through process.

#### Framework:

- "Bit" as the unit of communication. Arbitrary. Ok as long as all analysis is consistent.
- Two parties: A and B. An event occurs at one end, A, who must "communicate" the outcome to B.
- What are the minimum "units" required to communicate?
- The Key insight from Shannon: it depends on apriori belief distribution about the event
- That is, "uncertainty" with respect to the event. If an outcome is absolute or impossibly, why communicate at all?

- For instance, a sequence of biased coin flips contains less information (less degree of uncertainty) than a sequence of unbiased coin flips, so it should take fewer bits to transmit
- Keep in mind that if the likelihood of n outcomes is not equally likely, we can use variable length encoding to use less bits to transmit more likely outcome.

#### Information Content / Degree of Uncertainty / "Entropy"

Entropy of R.V. X is denoted by H(X)

= E[log( 1/p(X) )]

Intuition behind the formula:

- Lets say all n events are equally likely and P(x) = p for all x.
- Then, -log(p) = log(1/p), which is log(n).

- But since the actual distribution may not be uniform, take the weighted average of information content of each possible outcome - weighted by the probability of the outcome occurring.
- Thus, more frequent values can be encoded by smaller length messages.

#### Basic Results to Remember

- Joint Entropy:
- if X, Y are independent events, joint entropy is H(X) + H(Y)
- Why? Need to communicate both, Duh!

- Conditional Entropy:
- How much extra information you still need to supply on avg to communicate Y, given that the outcome of X is known

- How much extra information you still need to supply on avg to communicate Y, given that the outcome of X is known
- Chain Rule:
- H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
- H(X1 .., Xn) = .... just expand

- Mutual Information: I(X;Y) = H(X) - H(X|Y)
- "Reduction" in uncertainty about one Random Variable by knowing the outcome of other
- How far a joint distribution is from independence
- The amount of information one R.V. contains about another
- Note: I(X;X) = H(X) - H(X|X) = H(X)
- Conditional Mutual Information and the chain rule
- I(X; (Y|Z))
- = I( (X; Y)|Z)
- = H(X|Z) - H(X|(Y, Z))
- Kullback-Lieibler Divergence (Relative Entropy)
- For two p.m.f. p(x) and q(x),
- Measures how different two probability dists are over the same event space.
- Avg number of bits that are wasted by encoding events from dist p with code based on dist q.
- D(p||q)=0 iff p=q
- Often seen as "distance" but not quite right because i) not symmetric in p an q, and ii) does not satisfy: d(x,y) <= d(x,z) + d(z, y)
- Cross Entropy
- Let p be the true dist, q be the model dist, then Cross entropy is defined as:
- ,
- Avg number of bits needed to communicate events when coding scheme is based on model dist q rather than true dist p
- Cross entropy minimization is typically used when a model dist q is to be designed (often without knowing true dist p)

Not so basic results and some applications of IT in Part 2 ..