Welcome to my Machine Learning notes

I am Ikram Ali, I specialize in building ML engineering and data science teams from the ground up. My current passion is NLP and MLOps

I approach my career by purposefully building domain knowledge in all the cross-functional disciplines required to deliver successful Data Science projects. This includes Research, Data Engineering, Machine Learning Engineering, Management, as well as dabbling in Agile Program Management and Product Management within other roles. This allows me to lead cross-functional teams and know the pain points of bringing a model from ideation to production. For further information regarding my credentials, please refer to my LinkedIn and Github profiles

I would like to offer concise definitions and comprehensible explanations of Machine Learning and Deep Learning.

Contents

ML Notation / Equations

Notation

Symbol

Formula

Explained

\(\mu\)

\(\sum_{x} k P(X=x) = \int_{-\infty}^{\infty} x f(x) d x\)

πŸ”—

\(V(X)\) or \(\sigma^2\)

\(E[(X - E[X])^2] = E[(X - \mu)^2] = E[X^2] - E[X]^2\)

πŸ”—

\(\sigma\)

\(\sqrt{V(X)}\)

Standard deviation

\(Cov(X,Y)\)

Covariance of X and Y

Covariance of X and Y

\(\bar{X}\)

The sample

The sample mean is an average value

\(\delta\)

\(\delta(v)\)

Activation fucntions, sigmoid, relu, etc.

Equations

Cosine Similarity

Cosine similarity is a metric used to measure the similarity between two vectors in a multi-dimensional space. Cosine similarity measures the cosine of the angle between two non-zero vectors in an n-dimensional space.

Formula = dot product / normalized sum of squares

\[ \text{cos}(x,y) = \frac{x \cdot y}{\sqrt{x^2} \cdot \sqrt{y^2}} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}} \]
Properties
  • Scale Invariance Cosine similarity is scale-invariant, meaning it is not affected by the magnitude of the vectors, only by their orientations.

  • One hot and multi hot vectors easily.

import torch
from torch.nn import functional as F
v1 = torch.tensor([0, 0, 1], dtype=torch.float32)
v2 = torch.tensor([0, 1, 1],dtype=torch.float32)

print(F.cosine_similarity(v1, v2 , dim=0))

print(F.normalize(v1, dim=0) @ F.normalize(v2, dim=0))

print(torch.norm(v1) / torch.norm(v2))

print( torch.matmul(v1, v2.T) / ( torch.sqrt( torch.sum(v1 ** 2)) * torch.sqrt( torch.sum(v2 ** 2))) )
tensor(0.7071)
tensor(0.7071)
tensor(0.7071)
tensor(0.7071)
/tmp/ipykernel_1144/3833807425.py:10: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
  print( torch.matmul(v1, v2.T) / ( torch.sqrt( torch.sum(v1 ** 2)) * torch.sqrt( torch.sum(v2 ** 2))) )

What is Probability

Definition

  • Probability is the branch of mathematics that deals with the occurrence of a random event.

  • Probability is the measure of the likelihood of an event to happen.

Probability is the study of randomness and uncertainty. Probability theory is widely used in the area of studies such as statistics, finance, gambling, artificial intelligence, machine learning, computer science, game theory, and philosophy.

Sample space
Applications of probability

Some of the applications of probability are predicting results of the following events:

Minor

  • that a customer will buy milk if they are also buying bread.

  • Of getting at least 2 heads in 5 coin flips.

  • Getting 3 and 5 on throwing a die.

  • Pulling a green candy from a bag of red candies.

  • Winning a lottery 1 in many millions.

  • # of customers arriving at a bank in a week

Major

  • It is used for risk assessment and modelling in various industries

  • Weather forecasting or prediction of weather changes

  • Probability of a team winning in a sport based on players and strength of team

  • In the share market, chances of getting the hike of share prices

Probability Terminology

The first thing we do when we start thinking about the probability list a number of things that could possibly happen.
Some of the important probability terms are discussed here.

Sample Space

Sample space of an experiment, denoted \(S\), is the set of all possible outcomes of an experiment or trial.

  • Suppose that we toss a die. Six numbers, from 1 to 6, can appear face up, but we do not yet know which one of them will appear. The sample space is S = {1,2,3,4,5,6}.

  • For tossing is a fair coin, the sample space is S = {H, T}

Sample space
Sample space

Image from byjus.com & medium.com

Experiment or Trial

Experiment is any action or process that generates observations or outcomes.\ E.g. The tossing of a coin, selecting a card from a deck of cards, throwing a dice etc.

Outcome or Sample Point

An outcome is a possible result of an experiment or trial.
E.g. The outcome of tossing a coin is a head or a tail.
Roll a die, the outcome is a number between 1 and 6. Each of the six numbers is a sample point

Event

Event is any possible outcome, or combination of outcomes, of an experiment.
E.g. Getting a Head while tossing a coin is an event.

Cardinality

Cardinality of a sample space or an event, is the number of outcomes it contains. \(|S|\) represents the cardinality of the sample space.
Tossing a coin, \(|S|\) = 2, Rolling a die, \(|S|\) = 6
Flip a coin twice, S = {00,01,11,10} \(|S|\) = 4
Flip a coin until you get a tail. \(S=\{1,01,001,0001, \ldots\}\) \(|S| = \infty\)

Population

Those individuals or objects from which we want to acquire information or draw a conclusion.
E.g. All valves produced by a specific manufacturing plant.
All adult females in the United States.
All smokers

Sample

Most of the time, the population is so large, we can only collect data on a subset of it. We will call this our sample.

Sets and Subsets

A set is defined as a group of objects (i.e., sets can be made up of letters, numbers, names, etc.).
A subset is defined as a set within a set. set A is a subset of set B if and only if every element of A is also in B.

Empty Set

The set that contains nothing, denoted \(\emptyset\).

Complement

\(A^c\) = A complement. This is a shorthand way of saying when A does not occur. This set is made up of everything not in A.

Parameter

Parameters are the unknown values of an entire population, such as the mean and standard deviation. Samples can estimate population parameters but their exact values are usually unknowable.

Interview Question

Q: What is the sample space of rolling Two Dice?
Ans: The total number of joint outcomes (a,b) is 6 times 6 which is 36.

Axioms of Probability
Axiom 1:

For any event, β€˜A’ the probability of possible outcomes is either 0 or 1, where 0 is the event which never occurs, and 1 is the event will certainly occur. For any event \(A, 0 \leq P(A) \leq 1\).

Axiom 2:

The sum of probabilities of all possible outcomes is 1.Probability of the sample space S is \(P(S)=1\).

Axiom 3:

If \(A_{n}\) mutually exclusive events (intersection of any two is the empty set) then \(P\left(\bigcup_{i = 1}^k A_n\right) = \sum_{k=1}^{n} P\left(A_{k}\right)\)

Axiom 4:

The complement of any event A is the event that consists of all the outcomes that are not in A.

Axiom 5:

If both A and B are independent, then the conditional probability that event B occurs given that event A has already occurred. P ( A and B) = P (A) P (B | A). This is called the General rule of multiplication.

Counting

Despite the trivial name of this topic, be assured that learning to count is not as easy as it sounds.

Naive Probability

The probability of an event occurring, if the likelihood of each outcome is equal, is:

\[ P(\text { Event })=\frac{\text { number of favorable outcomes }}{\text { number of outcomes }} \]

When we are working with probabilities, our notation will be P(A). this means the Probability that event A occurred. So, if A is the event of flipping heads in one flip of a fair coin, then P(A) = .5

This Naive Definition is a reasonable place to start, because it’s likely how you have calculated probabilities up to this point. Of course, this is not always the correct approach for real world probabilities (hence the name naive).

Multiplication Rule

To understand the Multiplication Rule, visualize a process that has multiple steps, where each step has multiple choices. For example, say that you are ordering a pizza.

  1. Size (small, medium, or large)

  2. Topping (pepperoni, meatball, sausage, extra cheese)

  3. Order Type (delivery or pickup)

Using the multiplication rule, we can easily count the number of distinct pizzas that you could possibly order. Since there are 3 choices for size, 4 choices for toppings, and 2 choices for pickup.

we simply have 3 β‹… 4 β‹… 2 = 24 different pizza options.

Now that we have counted the total of number of possible pizzas, it is easy to solve various probability problems.

Interview Question

Q: What are the outcomes of flipping a fair coin and simultaneously rolling a fair die?
Ans: 6 x 2 = 12 outcomes.

Q: How many possible license plates could be stamped if each license plate were required to have exactly 3 letters and 4 numbers?
Ans: 26 x 26 x 26 x 10 x 10 x 10 x 10 = 175,650,000

Factorial

You may have used the factorial for simple arithmetic calculations.

\[\begin{split} \begin{gather} n! = n \times n-1 \times n-2 \times \ldots \times 1 \\ 5! = 5 \times 4 \times 3 \times 2 \times1 \\ \large n! = \prod_{i=1}^{n} i \end{gather} \end{split}\]

Another use for the factorial function is to count how many ways you can choose things from a collection of things or find how many ways things can be arranged.

Example

Counting the the number of ways to order the letters A, B, and C. We will define a specific arrangement or order as a permutation. You could likely figure this out by just writing out all of the permutations:

{ABC,ACB,BAC,BCA,CAB,CBA}

It’s clear that there are 6 permutations. what if you had to do the same for all 26 letters in the alphabet? if you didn’t feel like writing out the 26 letters over and over and over, you could use the factorial for a more elegant solution.

the number of permutations when ordering A,B and C is 3!

3 β‹… 2 β‹… 1 = 6

Another example, In how many ways can 7 different books be arranged on a shelf?

We could use the Multiplication Principle to solve this problem. We have seven positions that we can fill with seven books. There are 7 possible books for the first position, 6 possible books for the second position, five possible books for the third position, and so on. The Multiplication Principle tells us therefore that the books can be arranged in:

\[ 7 β‹… 6 β‹… 5 β‹… 4 β‹… 3 β‹… 2 β‹… 1 = 5040 \]

Alternatively, we can use the simple rule for counting permutations. P = 7! = 5040

Python Solution
from math import factorial

print(factorial(3))
print(factorial(6))
6
720
Binomial Coefficient

The binomial coefficient is a mathematical formula that counts the number of ways to choose k items from a collection of n items. This is perhaps the most useful counting tool. which in english is pronounced n choose x = \(\tbinom{n}{k}\).

\[ \large \tbinom{n}{k} = \frac{n!}{k!(n-k)!} \]

With replacement

means the same item can be chosen more than once.

Without replacement

means the same item cannot be selected more than once.

Permutation

Permutation relates to the act of arranging all the members of a set into some sequence or order.

Any ordered sequence of k objects taken from a set of n distinct objects is called a permutation of size k.

\[\begin{split} \begin{gather} \textbf{All possible ways of doing something } \\ {P}_{n,k} = \frac{n!}{(n-k)!} \end{gather} \end{split}\]

When selecting more than one item without replacement and order does matter.

Example
Suppose an organization has 60 members. One person is selected at random to be the president, another
person is selected as the vice-president, and a third is selected as the treasurer.
How many ways can this be done? (This would be the cardinality of the sample space.)
\[ P_{3,60} = 60.59.58 = \frac{60!}{57!} = 205,320 \]
Combination

When selecting more than one item without replacement and order does not matter.

Given n distinct objects, any unordered subset of size k of the objects is called a combination.

\[ {C}_{n,k} = \binom nk = {n \choose k, n-k} = \frac{n!}{k!(n-k)!} \]
Example
Suppose we have 60 people and want to choose a 3 person team (order is not important). How many combinations are possible?
Combination
Suppose we have the same 60 people, 35 are female and 25 are male. We need to select a committee of 11 people.
How many ways can such a committee be formed?
\[ {C}_{60,11} = \frac{60!}{11!(60-11)!} = |S| \]
What is the probability that a randomly selected committee will contain at least 5 men and at least 5
women? (Assume each committee is equally likely.)
\[ \begin{align}\begin{aligned} \textbf{P(at least 5M and at least 5W on committee)}\\\begin{split}\begin{aligned} &=P(5 m+6 w)+p(6 m+5 w) \\ &=\frac{\left(\begin{array}{c} 25 \\ 5 \end{array}\right)\left(\begin{array}{c} 35 \\ 6 \end{array}\right)}{\left(\begin{array}{c} 60 \\ 11 \end{array}\right)}+\frac{\left(\begin{array}{c} 25 \\ 6 \end{array}\right)\left(\begin{array}{c} 35 \\ 5 \end{array}\right)}{\left(\begin{array}{c} 60 \\ 11 \end{array}\right)} \end{aligned} \end{split}\end{aligned}\end{align} \]
What is the probability of drawing the ace of spades twice in a row? (Assume that any card drawn on the first draw will
be put back in the deck before the second draw.)
\[ P(\text{ace of spades}) \times P(\text{ace of spades}) = \left(\frac{1}{52}\right)^2 = \frac{1}{2704} = 0.00037 = 0.037\% \]
You draw a card from a deck of cards. After replacing the drawn card back in the deck and shuffling thoroughly,
what is the probability of drawing the same card again? 
\[ P(\text{any card}) = \frac{52}{52} = 1 \]
\[ P(\text{same card as first draw}) = \frac{1}{52} \approx 0.019 \]
\[ P(\text{any card})P(\text{same card as first draw}) = (1)(\frac{1}{52}) = \frac{1}{52} \approx 0.019\]
Use $n \choose k$ to calculate the probability of throwing three heads in five coin tosses.
\[ {n \choose k} = {5 \choose 3} = \frac{5!}{3!(5 - 3)!} = \frac{5!}{(3!)(2!)} = \frac{5 \times 4 \times 3 \times 2 \times 1}{(3 \times 2 \times 1)(2 \times 1)} = \frac{120}{(6)(2)} = \frac{120}{12} = 10 \]
Twelve (12) patients are available for use in a research study. Only seven (7) should be assigned to receive the study
treatment. How many different subsets of seven patients can be selected?
\[ {n \choose k} = {12 \choose 7} = \frac{12!}{7!(12 - 7)!} = 792 \]
Torch combinations
import torch

a = torch.tensor([1, 2, 3])
print(torch.combinations(a))
print(torch.combinations(a, r=3))
torch.combinations(a, with_replacement=True)
tensor([[1, 2],
        [1, 3],
        [2, 3]])
tensor([[1, 2, 3]])
tensor([[1, 1],
        [1, 2],
        [1, 3],
        [2, 2],
        [2, 3],
        [3, 3]])
Difference Between Permutation and Combination

Permutation

Combination

Order matters

Order doesn’t matter

Number of ways to arrange the elements of a set.

Number of ways to choose k elements from a set of n elements.

Arranging people, digits, numbers, alphabets, letters, and colours.

Selection of menu, food, clothes, subjects, the team.

Picking a President, VP and Waterboy from a group of 10.

Picking a team of 3 people from a group of 10.

Listing your 3 favorite desserts, in order, from a menu of 10. P(10,3) = 720.

Choosing 3 desserts from a menu of 10. C(10,3) = 120.

Sampling Table

Order Matters

Order Doesn’t Matter

With Replacement

\(n^k\)

\({n+k-1 \choose k}\)

Without Replacement

\(\frac{n!}{k!(n-k)!}\)

\(\binom nk\)

Interview Questions
  • There are 25 students in a class. Find the number of ways in which a committee of 3 students is to be formed?
    25 choose 3 2300

  • In a meeting between two countries, each country has 12 delegates. All the delegates of one country shake hands with all delegates of the other country. Find the number of handshakes possible?
    Total number of handshakes = 12 x 12 = 144

  • How many groups of 6 persons can be formed from 8 men and 7 women?
    Total number of person = 8 men + 7 women = 15
    15 choose 6 = 5005

Bayes Theorem

Definition

Bayes theorem is also known as the formula for the probability of causes.

Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event.

Conditional Probability

Two events A and B from the same sample space S. Calculate the probability of event A knowing that event B has occurred. B is the β€œconditioning event”. \(P(A|B)\)

Conditional Probability is \(P(A \mid B)=\frac{P(A \cap B)}{P(B)}, \quad P(B)>0\)

Multiplication Rule

This leads to the multiplication rule \(P(A \cap B) = P(B) P(A \mid B) = P(A) P(B \mid A)\)

Bayes Theorem

Bayes Theorem \(P(A \mid B) = \frac{P(B \mid A)P(A)} {P(B)}\)

Example

Bayes Theorem can detect spam e-mails.

  • Assume that the word offer occurs in 80% of the spam messages.

  • Also assume offer occurs in 10% of my desired e-mails.

Question

If 30% of the received e-mails are considered as a scam, and I will receive a new message which contains β€˜offer’, what is the probability that it is spam?

I received 100 e-mails.

  • I have 30 spam e-mails

  • 70 desired e-mails.

The percentage of the word β€˜offer’ that occurs in spam e-mails is 80%. It means 80% of 30 e-mail and it makes 24. Now, I know that 30 e-mails of 100 are spam and 24 of them contain β€˜offer’ where 6 of them not contains β€˜offer’.

The percentage of the word β€˜offer’ that occurs in the desired e-mails is 10%. It means 7 of them (10% of 70 desired e-mails) contain the word β€˜offer’ and 63 of them not.

Sample space

The question was what is the probability of spam where the mail contains the word β€˜offer’:

  1. We need to find the total number of mails which contains β€˜offer’ ; 24 +7 = 31 mail contain the word β€˜offer’

  2. Find the probability of spam if the mail contains β€˜offer’ ;

In 31 mails 24 contains β€˜offer’ means 77.4% = 0.774 (probability)

NOTE: In this example, I choose the percentages which give integers after calculation. As a general approach, you can think that we have 100 units at the beginning so if the results are not an integer, it will not create a problem. Such that, we cannot say 15.3 e-mails but we can say 15.3 units.

Solution with Bayes’ Equation:

A = Spam

B = Contains the word β€˜offer’

\[ P(\text { spam } \mid \text { contains offer })=\frac{P(\text { contains offer } \mid \text { spam }) * P(\text { spam })}{P(\text { contains offer })} \]

P( contains offer|spam) = 0.8 (given in the question)

P(spam) = 0.3 (given in the question)

Now we will find the probability of e-mail with the word β€˜offer’. We can compute that by adding β€˜offer’ in spam and desired e-mails. Such that;

P(contains offer) = 0.30.8 + 0.70.1 = 0.31

\[ P(\text { spam } \mid \text { contains offer })=\frac{0.8 * 0.3}{0.31}=0.774 \]

As it is seen in both ways the results are the same. In the first part, I solved the same question with a simple chart and for the second part, I solved the same question with Bayes’ theorem.

Law of Total Probability

\(B=(B \cap A) \cup\left(B \cap A^{c}\right)\)

\(P(B)=P(B \cap A)+P\left(B \cap A^{c}\right)=P(B \mid A) P(A)+P\left(B \mid A^{c}\right) P\left(A^{c}\right)\)

Independence and Mutually Exclusive Events

Two events are independent if knowing the outcome of one event does not change the probability of the other.

  • Flip a two-sided coin repeatedly. Knowing the outcome of one flip does not change the probability of the next.

Two events, A and B, are independent if \(P(A|B) = P(A)\), or equivalently \(P(B|A) = P(B)\).

Recall: \(P(A \mid B)=\frac{P(A \cap B)}{P(B)}\)

then, if A and B are independent, we get the multiplication rule for independent events:

\(P(A \cap B)=P(A) P(B)\)

Example

Suppose your company has developed a new test for a disease. Let event A be the event that a randomly selected individual has the disease and, from other data, you know that 1 in 1000 people has the disease. Thus, P(A) = .001. Let B be the event that a positive test result is received for the randomly selected individual. Your company collects data on their new test and finds the following:

  • \(P(B|A)\) = .99

  • \(P(B^c |A)\) = .01

  • \(P(B|A^c )\) = .02

Calculate the probability that the person has the disease, given a positive test result. That is,

find \(P(A|B)\).

\[\begin{split} \begin{aligned} P(A \mid B) &=\frac{P(A \cap B)}{P(B)} \\ &=\frac{P(B \mid A) P(A)}{P(B)} \leftarrow \text { Bayes theorem } \\ &=\frac{ P(B\mid A) P(A)}{P(B \mid A) P(A)+P\left(B \mid A^{c}\right) P\left(A^{c}\right)} \leftarrow \text { Law of Total Probability } \\ &=\frac{(.99)(.001)}{(.99)(.001)+(.02)(.999)} \\ &=.0472 \end{aligned} \\ P(A) =.001 \leftarrow \text { prior prob of } A \\ P(A \mid B) =.0472 \leftarrow \text { posterior prob of } A \end{split}\]

Random Variables

The first step to understand random variable is to do a fun experiment. Go outside in front of your house with a pen and paper. Take note of every person you pass and their hair color & height in centimeters. Spend about 10 minutes doing this.

Congratulations! You have conducted your first experiment! Now you will be able to answer some questions such as:

  • How many people walked past you?

  • Did many people who walked past you have blue hair?

  • How tall were the people who walked past you on average?

You pass 10 people in this experiment, 3 of whom have blue hair, and their average height may be 165.32 cm. In each of these questions, there was a number; a measurable quantity was attached.

Definition

A random variable rv is a real-valued function, whose domain is the entire sample space of an experiment. Think of the domain as the set of all possible values that can go into a function. A function takes the domain/input, processes it, and renders an output/range. This set of real values obtained from the random variable is called its range.

A random variable (rv) is a function that maps events (from the sample space S) to the real numbers. It’s a function which performs the mapping of the outcomes of a random process to a numeric value.

The domain of a random variable is a sample space, which is represented as the collection of possible outcomes of a random event. For instance, when a coin is tossed, only two possible outcomes are acknowledged such as heads or tails.

Denoted by

Random variables Denote by a capital letters near the end of the alphabet (e.g. X, Y ).

Note

Why is it called a random variable?
Because we think of it as a variable that take random value intuitively. Formally they are function.

Probability Distribution

A Probability Distribution is a graph, table, or function that gives the probability for each value of the random variable.

Requirments
  1. The sum of the probabilities is 1. \(\sum f(x)=1 \).

  2. Every probability \(p_i\) is a number between 0 and 1. \( 0 \leq f(x) \leq 1\)

Difference between random variables and probability distributions

A random variable is a numerical description of the outcome of a statistical experiment. The probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable.

Types of Random Variables

Types of Random Variables
Discrete random variable

A discrete random variable is a type of random variable that has a countable number of distinct values that can be assigned to it, such as in a coin toss.

Continuous random variable

A continuous random variable stands for any amount within a specific range or set of points and can reflect an infinite number of potential values, such as the average rainfall in a region.

Big Picture In statistics, we will model populations using random variables (e.g. mean, variance) of these random variables will tell us about the population we are studying.

Probability mass function (P.M.F)

The probability that a discrete random variable \(X\) takes on a particular value \(x\) that is \(P(X=x)\) is denoted by

\[ \text{p.m.f } \large = f(x) = f_x(x) = f_y(y) \]
Properties

The probability mass function, \(P(X=x)=f(x)\), of a discrete random variable \(X\) is a function that satisfies the following properties:

  1. All of the probabilities must be positive. \(P(X=x)=f(x)>0\), if \(x \in\) the support \(S\)

  2. Sum of all probabilities of same sample space equals to 1. \(\sum_{x \in S} f(x)=1\)

  3. \(P(X \in A)=\sum_{x \in A} f(x)\)

\(\text{Random variable}=X= \begin{cases} 1, & \text { if "Heads" } \\ 0, & \text { if "Tails" } \end{cases} = \begin{cases} P(X=1), & \text { if "Heads" } \\ P(X=0), & \text { if "Tails" } \end{cases}\)

\(PMF=f(x)=f_x(x)=P(X=x)= \begin{cases}1 / 2, & \text { if } x=0 \\ 1 / 2, & \text { if } x=1 \\ 0, & \text { otherwise }\end{cases}\)

\[ p(x)=P(X=x)=P(\text { all } x \in S \mid X(s)=x) \]

Interview Question

Q: Let \(f(x)=c x^{2}\) for \(x=1,2,3\). Determine the constant \(c\) so that the function \(f(x)\) satisfies the conditions of being a probability mass function?

Answer: Using property no 2

\[\begin{split} \begin{aligned} \sum_{x=1}^{3} f(x) &=\sum_{x=1}^{3} c x^{2}=c \sum_{x=1}^{3} x^{2} \\ &=c\left[1^{2}+2^{2}+3^{2}\right]=c[1+4+9] \\ &=c(14) \stackrel{\text { set }}{=} 1 = c=1/14 \\ f(x) &=\frac{1}{14} x^{2} \text { for } x=1,2,3 \end{aligned} \end{split}\]

Cumulative distribution function (CDF)

The cumulative distribution function (CDF or cdf) of the random variable X has the following definition:

\[ F_X(t) = P(X \leq t) = \sum_{x \leq y} P(X=t) = \int_{-\infty}^{t} f(t)dt \]
Properties

The cdf of random variable X has the following properties:

  1. The cdf, \(F_{X}(t)\), ranges from 0 to 1 . This makes sense since \(F_{X}(t)\) is a probability.

  2. If \(X\) is a discrete random variable whose minimum value is \(a\), then \(F_{X}(a)=P(X \leq a)=P(X=a)=f_{X}(a)\). If \(c\) is less than \(a\), then \(F_{X}(c)=0\).

  3. If the maximum value of \(X\) is \(b\), then \(F_{X}(b)=1\).

  4. Also called the distribution function.

Example

Suppose X is a discrete random variable. Let the pmf of X be equal to

\[ f(x)=\frac{5-x}{10}, \quad x=1,2,3,4 \]

Suppose we want to find the cdf of \(X\). The cdf is \(F_{X}(t)=P(X \leq t)\).

  • For \(t=1, P(X \leq 1)=P(X=1)=f(1)=\frac{5-1}{10}=\frac{4}{10}\).

  • For \(t=2, P(X \leq 2)=P(X=1\) or \(X=2)=P(X=1)+P(X=2)=\frac{5-1}{10}+\frac{5-2}{10}=\frac{4+3}{10}=\frac{7}{10}\)

  • For \(t=3, P(X \leq 3)=\frac{5-1}{10}+\frac{5-2}{10}+\frac{5-3}{10}=\frac{4+3+1}{10}=\frac{9}{10}\).

  • For \(t=4, P(X \leq 4)=\frac{5-1}{10}+\frac{5-2}{10}+\frac{5-3}{10}+\frac{5-4}{10}=\frac{10}{10}=1\).

Probability density function (PDF)

X = f(x) is the probability density function of the continues random variable X.

\[ P(a \leq X \leq b)=\int_{a}^{b} f(x) d x \]

f(x) = Curve under which area represent the probability \(P(a \leq X \leq b)=\int_{a}^{b} f(x) d x\)

Expected Value (Mean or Average)

The concept was first devised in the 17th century to analyze gambling games and answer questions such as:

  • How much do I gain - or lose - on average, if I repeatedly play a given gambling game?

  • How much can I expect to gain - or lose - by making a certain bet?

For example, if you play a game where you gain 2$ with probability 1/2 and you lose 1$ with probability 1/2, then the expected value of the game is half a dollar

\[ 2$ \times \frac{1}{2} + (-1$) \times \frac{1}{2} = \frac{1}{2} $ = 0.5$ \]

it means that if you play this game many times, and the number of times each of the two possible outcomes occurs is proportional to its probability, then on average you gain 1/2$ each time you play the game.

Definition

The expected value or mean of a random variable is a weighted average of all possible outcomes. In the case of a continuum of possible outcomes, the expectation is defined by integration.

Denoted by \(\mu_x\) or \(E(X)\).

\[ \large \mu=\mu_x=E(X)=\sum_{x} k P(X=x) = \int_{-\infty}^{\infty} x f(x) d x \]
Example

5 exams result : 70 +80 + 80 + 90 + 90

\(Avg = \frac{70+80+80+90+90}{5} = \frac{1}{5}(70)+\frac{2}{5}(80)+\frac{2}{5}(90) = 82.5 \)


Let X represent the outcome of a roll of a fair six-sided die. The possible values for X are 1, 2, 3, 4, 5, and 6, all of which are equally likely with a probability of \(1/6\) The Expected Value of X is

\(E[X] = 1\cdot\frac16 + 2\cdot\frac16 + 3\cdot\frac16 + 4\cdot\frac16 + 5\cdot\frac16 + 6\cdot\frac16 = (1+2+3+4+5+6) / 6= 3.5\)


x

1

2

3

P(X=x)

1/4

1/4

1/2

\(E[X] =(1)(1 / 4)+(2)(1 / 4)+(3)(1 / 2) = 9/4 = 2.25 = \sum_{x} x P(X=x)\)


Imagine a game in which, on any play, a player has a 20% chance of winning \(3 and an 80% chance of losing \)1. The probability mass function of the random variable , the amount won or lost on a single play is:

\[\begin{split} x \quad $3 \quad -$1 \\ f(x) \quad 0.2 \quad 0.8 \end{split}\]

so the average amount won (actually lost, since it is negative)

\[ E(X)=(\$ 3)(0.2)+(-\$ 1)(0.8)=\$-0.20 \]

In the long run you guaranteed to lose no more than 20 cents.

Pytorch implementation
import torch

# Create a tensor
T = torch.Tensor([2.453, 4.432, 0.754, -6.554])
print("T:", T)

# Compute the mean and standard deviation
mean = torch.mean(T)
print("mean:", mean)
T: tensor([ 2.4530,  4.4320,  0.7540, -6.5540])
mean: tensor(0.2713)
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="darkgrid")
data = torch.randn(25)

print(data)

print("Mean :", torch.mean(data))

_ =sns.displot(data,kde=True, )
plt.axvline(torch.mean(data), color='green')
plt.show()
tensor([ 0.8597, -0.6286, -1.7821, -0.4426,  1.1676, -0.2283, -1.2379, -1.2242,
        -0.7144,  1.3822,  0.9386,  0.4657, -1.2674,  0.6671,  0.3613, -1.8003,
        -0.3120, -1.0845, -1.9948, -0.5006,  0.2419, -1.2459,  0.3558,  1.7665,
        -0.3110])
Mean : tensor(-0.2627)
_images/8b30035888f935e33d26ccafa8bd2e7dbfebda283224146238994e9d682d4818.png
Properties

Expectation is a linear operator, which means for our purposes it has a couple of nice properties.

Expected value of a constant

A perhaps obvious property is that the expected value of a constant is equal to the constant itself.

\[ \large E[c] = c \]
\[\begin{split} \begin{gather} \large \text{Proof}\\ E[2] = \sum_{x} k P(X=x) = \sum_{x} 2 P(X=x) \\ = 2 \sum_{x} P(X=x) = 2 \times 1 \\ = 2 \end{gather} \end{split}\]
Scalar multiplication of a random variable

If X is a random variable and a is a constant, then

\[ E[aX] = aE[X] \]
\[\begin{split} \begin{gather} \large \text{Proof}\\ E[2X] = \sum_{x} 2 k P(X=x) = 2 \sum_{x} k P(X=x) \\ = 2 E[X] \\ \end{gather} \end{split}\]
Expectation of a product of random variables

Let X and Y be two random variables. In general, there is no easy rule or formula for computing the expected value of their product. However, if X and Y are statistically independent, then

\[ E[XY] = E[X]E[Y] \]
Expectation of a sum of random variables
\[ E(X+Y)=E(X)+E(Y) \]
If random variables is function
\[\begin{split} E(g(X))=\left\{\begin{array}{l} \sum_{k} g(k) P(X=k), X \text { is discrete } \\ \int_{-\infty}^{\infty} g(x) f(x) d x, X \text { is continuous. } \end{array}\right. \end{split}\]

\(E(a X+b)=\sum_{k}(a X+b) P(X=k)\)

\(E(a X+b)= a \sum_{k} k P(X=k)+b \sum_{k} P(X=k)\)

\(E(a X+b)= a E(x) + b * 1 = a E(x) + b\)

Law of the Unconscious Statistician

IF X with pdf \(f_x(x)\) and g is a function Find 𝖀[𝗀(𝖷)]

Let Y=g(X). The pdf for Y is:

\(f_{Y}(y)=f_{X}\left(g^{-1}(y)\right) \cdot\left|\frac{d}{d y} g^{-1}(y)\right| = \text { So, } E[g(X)]=E[Y]=\int_{-\infty}^{\infty} y \cdot f_{Y}(y) d y\)

\(=\int_{-\infty}^{\infty} y \cdot f_{x}\left(g^{-1}(y)\right) \cdot\left|\frac{d}{d y} g^{-1}(y)\right| d y\)

\(\text { Let } x=g^{-1}(y) \text {. Then } d x=\frac{d}{d y} g^{-1}(y) d y\)

\(E[g(X)]=\int_{-\infty}^{\infty} g(x) f_{X}(x)) d x\)

Variance

  • Measures how far we expect our random variable to be from the mean.

  • Measures of spread of a distribution.

  • Variance is a measure of dispersion.

Denoted by
\[ \large \sigma^2 \text{ or } V(X). \]
\[ V(X) = E[(X - E[X])^2] = E[(X - \mu)^2] = E[X^2] - E[X]^2 \]

To better understand the definition of variance, we can break up its calculation in several steps:

  1. Compute the expected value of \(X\), denoted by \(\mathrm{E}[X]\).

  2. Construct a new random variable \(Y=X-\mathrm{E}[X]\) equal to the deviation of \(X\) from its expected value.

  3. Take the square \( Y^{2}=(X-\mathrm{E}[X])^{2} \) which is a measure of distance of \(X\) from its expected value (the further \(X\) is from \(\mathrm{E}[X]\), the larger \(\left.Y^{2}\right)\)

  4. Finally, compute the expectation of \(Y^{2}\) to know the average distance:

\[ \mathrm{E}\left[Y^{2}\right]=\mathrm{E}\left[(X-\mathrm{E}[X])^{2}\right]=\operatorname{Var}[X] \]
From these steps we can easily see that
  • variance is always positive because it is the expected value of a squared number.

  • the variance of a constant variable \(X\) (i.e., a variable that always takes on the same value) is zero; in this case, we have that \(X=\mathrm{E}[X], Y^{2}=0\) and \(\mathrm{E}\left[Y^{2}\right]=0\)

  • the larger the distance \(Y^{2}\) is on average, the higher the variance.

\[\begin{split} \begin{gather} \large \text{Proof} \\ V(X) = E[(X - \mu)^2] \\ V(X) = E[X^2 - 2\mu X + \mu^2] \\ V(X) = E[X^2 - 2\mu E[X] + \mu^2] \\ V(X) = E[X^2 - 2\mu^2 + \mu^2] \\ V(X) = E[X^2 - \mu^2] \\ V(X) = E[X^2] - E[X]^2 \end{gather}\end{split}\]
For continuous rv

If X is a continuous random variable, the variance is defined by the integral of the probability density function. \(V(X)=\int_{-\infty}^{\infty} (x - \mu_x)^2 f(x) d x\)

\(V(X)=\int_{-\infty}^{\infty} (x - \mu_x)^2 f(x) d x\)

\(= \int_{-\infty}^{\infty}\left(x^{2}-2 \mu_{x} x+\mu_{x}^{2}\right) f(x) d x\)

\(= \int_{-\infty}^{\infty}x^{2} f(x) d x - 2 \mu_{x} \int_{-\infty}^{\infty}x f(x) d x + \mu_{x}^{2} \int_{-\infty}^{\infty}f(x) d x\)

\(V(X) = E(X^2)-E(X)^2\)

Properties
Addition to a constant

Let \(a \in \mathbb{R}\) be a constant and let \(X\) be a random variable.

\[ Var[a+X]=Var[X] \]

Thanks to the fact that \(\mathrm{E}[a+X]=a+\mathrm{E}[X]\) (by linearity of the expected value), we have

\[\begin{split} \begin{aligned} \operatorname{Var}[a+X] &=\mathrm{E}\left[(a+X-\mathrm{E}[a+X])^{2}\right] \\ &=\mathrm{E}\left[(a+X-a-\mathrm{E}[X])^{2}\right] \\ &=\mathrm{E}\left[(X-\mathrm{E}[X])^{2}\right] \\ &=\operatorname{Var}[X] \end{aligned} \end{split}\]
Multiplication by a constant

Let \(a \in \mathbb{R}\) be a constant and let \(x\) be a random variable.

\[ Var[a X]=a^{2} Var[X] \]

Thanks to the fact that \(E[a X]=a E[X]\) (by linearity of the expected value), we obtain

\[\begin{split} \begin{aligned} Var[a X] &=E\left[(a X-E[a X])^{2}\right] \\ &=E\left[(a X-a E[X])^{2}\right] \\ &=\mathrm{E}\left[a^{2}(X-\mathrm{E}[X])^{2}\right] \\ &=a^{2} \mathrm{E}\left[(X-\mathrm{E}[X])^{2}\right] \\ &=a^{2} \operatorname{Var}[X] \end{aligned} \end{split}\]

Find Var[aX] = ?

Let Y = aX. Then, \(\mu_y = E[Y] = E[aX] = E[a\mu_x] = aE[\mu_x] = aE[X]\)

==> \(Var[aX] = Var[Y] = Var[(Y - \mu_y)^2] = a^2 Var[(X - \mu_x)^2] = a^2 V(X)\)

For Function

\(V(g(X))= \begin{cases}\sum_{k}(g(k)-E(g(X)))^{2} P(X=k), & X \text { discrete } \\ \int_{-\infty}^{\infty}(g(x)-E(g(X)))^{2} f(x) d x, & X \text { continuc }\end{cases}\)

Find V(a X+b)

\(V(a X+b)=E[(a X+b-E(a X+b))^2]\)

\(= E[(a x+ \not{b} -a E(x)- \not{b})^2]\)

\(= E[(a^2 (x - E(x))^2]\)

\(= a^2 E[(x - E(x)^2] = a^2 V(x)\)

Variance measure the spread the data B shift the data but doest not affect the spread.

Find Var[aX]

Let Y=aX. Then

\[\begin{split} \mu_{Y}=E[Y]=E[a X]=a E[X]=a \mu_{X} \\ Var[aX]=Var[Y]=E\left[\left(Y-\mu_{Y}\right)^{2}\right] \\ =a^{2} E\left[\left(X-\mu_{X}\right)^{2}\right] \\ =a^2 Var[X] \end{split}\]
Find Var[X + Y]
\[\begin{split} Var[X+Y]=Var[X]+Var[Y] \\ \end{split}\]
  • We will see that this is true if X and Y are independent.

  • Need concept of β€œcovariance”.

Standard Deviation

The standard deviation is the square root of the variance. \(\sigma_x = \sqrt{V(X)}\)

Indicator function

The indicator function of an event is a random variable that takes

  • value 1 when the event happens;

  • value 0 when the event does not happen.

Let A = Set of real numbers

\[\begin{split} I_{A}(x)= \begin{cases}1, & \text { if } x \in A \\ 0, & \text { if } x \notin A\end{cases} \end{split}\]

Other definition

The indicator function of a subset A of a set X is a function.

\(\text{Indicator function}_{A}(X) = \mathbf{1}_A(x) =\begin{cases} 1, & \text { if } A \cap X \neq \emptyset \\ 0, & \text { otherwise }\end{cases}\)

Notation= \(\mathbb{1} _{A}(x)\)

Random Sample

A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.

\[ \large \text{Random Sample} = $X_1, X_2, X_3, ..., X_n \]

Suppose that \(X_1, X_2, X_3, ..., X_n\) is a random sample from the Normal distribution with parameters \(\mu\) and \(sigma^2\). Mu and sigma are same for all random variables

\[ X_{1},X_{2}, \ldots, X_{n} \stackrel{\mathrm{iid}}{\sim} N(\mu, \sigma^{2}) \]

Suppose that \(X_1, X_2, X_3, ..., X_n\) is a random sample from the gamma distribution with parameters \(alpha\) and \(\beta\).

\[ X_{1},X_{2}, \ldots, X_{n} \stackrel{\mathrm{iid}}{\sim} \Gamma(\alpha, \beta) \]
Example

A good example is a succession of throws of a fair coin: The coin has no memory, so all the throws are independent. And every throw is 50:50 (heads:tails), so the coin is and stays fair - the distribution from which every throw is drawn, so to speak, is and stays the same: identically distributed.

Independent and identically distributed random variables (IID)

Random Sample == IID

Discrete Distributions

A discrete distribution is a distribution of data in statistics that has discrete values. Discrete values are countable, finite, non-negative integers, such as 1, 10, 15, etc.

Bernoulli Distribution

The Bernoulli distribution is a univariate discrete distribution used to model random experiments that have binary outcomes.

Bernoulli Random Variable

A Bernoulli RV \(X \sim Bern(p)\) is a random variable that is either 0 or 1 with probability \(p\) or \(1-p\) respectively. Suppose that you perform an experiment with two possible outcomes: either success or failure.

Let X be a discrete random variable. \( x \in {0,1} \)

\[\begin{split} f_x(x)=P(X=x)=\begin{cases} 1-p, & \text{ if x = 0 } \\ p, & \text{if x = 1 } \\ 0, & \text{otherwise} \end{cases} \end{split}\]
P.M.F
\[\begin{split} \begin{align} \large P(X=1) &=p \\ \large P(X=0)&=1-p \end{align} \end{split}\]
Bernoulli Distribution

Using the indicator function notation

\(I_{A}(x)= \begin{cases}1, & \text { if } x \in A \\ 0, & \text { if } x \notin A\end{cases}\)

\[ P(X=x)=p^{x}(1-p)^{1-x} \cdot I_{\{0,1\}}(x) \]
Mean (Expected Value)

The expected value of a Bernoulli random variable X is

\[ \large E[X]=p \]

Proof

\[\begin{split} \begin{align} \large \\ E[X] &=\sum_{x} k P(X=x) \\ &= 0 * P(x=0) + 1 * P(x=1) \\ &= 0 * (1-p) + 1 * (p) \\ &= p \end{align} \end{split}\]
Variance

The variance of a Bernoulli random variable X is

\[ Var[X] = p(1-p) \]
Proof

\(E(X^2) =\sum_{k} k^2 P(X=k) = 1^2 * p = p\)

\[\begin{split} V(X) &= {E}[X^2] - {E}[X]^2 \\ &= p - p^2 \\ &= p(1-p) \end{split}\]

Geometric Distribution

The geometric distribution is a discrete probability distribution that calculates the probability of the first success occurring during a specific trial.

The geometric distribution is the probability distribution of the number of failures we get by repeating a Bernoulli experiment until we obtain the first success.

Geometric Random Variable

Definition 1

A geometric rv \(X \sim Geom(p)\) consists of

  • independent Bernoulli trials,

  • each with the same probability of success p or Failure (1-p),

  • repeated until the first success is obtained.

Definition 2

The geometric rv is the distribution of the number of trials needed to get the first success in repeated independent Bernoulli trials.

Example If we toss a coin until we obtain head, the number of tails before the first head has a geometric distribution.

Attention

It can also define as number of failures before first success.

Parameter

The geometric distribution has one parameter, p = the probability of success for each trial. You denote the distribution as G(p), which indicates a geometric distribution with a success probability of p.

Uses
  • Six in a series of die rolls?

  • Person to support a law during a repeated sampling for an interview?

  • Product to have a defect in a random sample from an assembly line?

  • Successful attempt for a project or task?

Properties
  1. Each trial is identical, and can result in a success or failure.

  2. The probability of success, p, is constant from one trial to the next.

  3. The trials are independent, so the outcome on any particular trial does not influence the outcome of any other trial.

  4. Trials are repeated until the first success.

P.M.F

The sample space of geometric random variable is

\[ S = \{1,01,001,0001,00001,000001,\dots\} \]

Bernoulli trail success = 1 = P
Bernoulli trail failure = 0 = 1 - P

\(P(X=1)=p\)
\(P(X=2)=(1-p)p\)
\(P(X=3)=(1-p)(1-p)p\)
\(P(X=4)=(1-p)(1-p)(1-p)p\) = failure, failure, failure, Success
\(P(X=5)=(1-p)^{4}p\)
\(P(X=x)=(1-p)^{x-1}p\)

\[\begin{split} f(x) = P(X=x)&=(1-p)^{x-1}p \quad for \enspace x = {1,2,3,4,5,\dots} \\ f(x) =P(X=x)&=(1-p)^{x-1} \cdot p \cdot I_{\{1,2,3, \ldots\}}(x) \end{split}\]
Mean (Expected Value)

The expected value of a geometric random variable X is

\[\begin{split} E[X] &= \sum_{k=1}^{\infty} k P(X=k) \\ &= \sum_{k=1}^{\infty} k (1-p)^{k-1}p \\ &= \frac{1} p \end{split}\]
Variance

The expected value of a geometric random variable X is

\[\begin{split} V(X) &= E[X^2] - E[X]^2 \\ &= \frac{1-p}{p^{2}} \end{split}\]

Interview Question

Q: On each day we play a lottery in which the probability of winning is \(1%\).
What is the expected value of the number of days that will elapse before we win for the first time?

Answer: Each time we play the lottery, the outcome is a Bernoulli random variable (equal to 1 if we win), with parameter \(p=0.01\). Therefore, the number of days before winning is a geometric random variable with parameter \(p=0.01\). Its expected value is

\[ E[X] = \frac{1} p = \frac{1} {0.01} = 100 \]

Binomial Distribution

The binomial distribution is a discrete probability distribution that calculates the probability an event will occur a specific number of times in a set number of opportunities.

Binomial Random Variable

Definition 1

A binomial rv \(X \sim Bin(n,p)\) is a random variable that is the number of successes in n independent Bernoulli trials, each with probability p. The probability of success is p. The probability of failure is 1-p. The number of trials is n.

Definition 2

The binomial distribution is the distribution of the number of successes = X in a fixed number = n of independent Bernoulli trials.

Parameters

The binomial distribution has two parameters, n and p.

  • n: the number of trials.

  • p: the event or success probability.

Uses

Use the binomial distribution when your outcome is binary. Binary outcomes have only two possible values that are mutually exclusive.

  • Six heads when you toss the coin ten times?

  • 12 women in a sample size of 20?

  • Three defective items in a batch of 100?

  • Two flu infections over 20 years?

Properties
  1. Experiment is n trials (n is fixed in advance)

  2. Trials are identical and result in a success or a failure (i.e. Bernoulli trials) with P(success) = p and P(failure) = 1 - p.

  3. Trials are independent (outcome of one trial does not influence any other)

P.M.F

The sample space of binomial random variable is

\[\begin{split} S = \left\{\left(x_{1}, x_{2}, \ldots, x_{n}\right) \mid x_{i}\right. =\left\{\begin{array}{l} 1 \text { if } \text { success } \\ 0 \text { if failure }\end{array}\right. \end{split}\]

\(f(x)=P(X=0)=P(\{00 \cdots 0\})=(1-p)^{n}\)
\(f(x)=P(X=1)=P(\{10 \cdots 0,0100 \ldots,0 \cdots 01\}) = n*p*(1-p)^{n-1}\)
\(f(x)=P(X=2)=P(\{11 \cdots 0,0110 \ldots,00 \cdots 11\}) = \binom{n}{2}p^2(1-p)^{n-2}\)

Explanation P(X=2): Among n number of fixed trials, we have 2 bernoulli trials successes with probability P and rest are failures bernoulli trails with probability (1-p). So, we need to choose 2 from n to get the exact probability of success.

\[ f(x)=P(X=x)= \binom{n}{x}p^x(1-p)^{n-x} \cdot I_{\{1,2,3, \ldots\}}(x) \]

Where k = 1 (success) and n-k = 0 (failure).

Suppose n = 4

\(\mathrm{P}(X=3)=\mathrm{P}(\mathrm{SSSF} \text { or } \mathrm{SSFS} \text { or SFSS or FSSS })\)

Binomial Theorem

\(\sum_{k=0}^n {n \choose k}p^{k}(1-p)^{n-k} = 1\)

Mean (Expected Value)

The expected value of a binomial random variable X is

\[\begin{split} E[X] &=\sum_{k} k P(X=k) \\ &=\sum_{k=0}^n k {n \choose k}p^{k}(1-p)^{n-k} \\ &= n * p \end{split}\]

Proof

\[\begin{split} \begin{array}{ll} =\mathrm{E}\left[\sum_{i=1}^{n} Y_{i}\right] \quad \text { (representation as a sum of } n \text { independent Bernoulli r.v.) } \\ =\sum_{i=1}^{n} \mathrm{E}\left[Y_{i}\right] \quad \text { (linearity of the expected value) } \\ =\sum_{i=1}^{n} p \quad \quad \text { (expected value of a Bernoulli r.v.) } \\ =n p \end{array} \end{split}\]

RECALL: Bern(p) has expected value p. x1, x2 … xn are independent bern p. so \(sum_{k=1}^n X_n = sum_{k=1}^n E[X_n] = n * p\)

Variance

The variance of a binomial random variable X is

\(V(X)= E(X^2) - E(X)^2 = n * p * (1-p)\)

Recall: Bern(p) has variance p * (1-p).

Negative Binomial Distribution

The negative binomial distribution is almost the same as a binomial distribution with one difference

  • Binomial distribution has a fixed number of trials.

Repeat independent Bernoulli trials until a total of r successes is obtained. The negative binomial random variable X counts the number of failures before the rth success.

Negative Binomial Random Variable

The negative binomial rv \(X \sim NB(r,p)\) is the distribution of the number of trials = X needed to get a fixed number of successes = r.

Properties
  1. The number of successes r is fixed in advance.

  2. Trials are identical and result in a success or a failure (Bernoulli trials with P(success) = p and P(failure) = 1-p.

  3. Trials are independent (outcome of one trial does not influence any other)

PMF

\(S = \left\{\left(x_{1}, x_{2}, \ldots, x_{n}\right) \mid x_{i}\right. =\left\{\begin{array}{l} 1 \text { if } \text { success on ith trail } \\ 0 \text { if failure ith trail }\end{array}\right. and \sum_{i=1} = r\)

\(P(y=0)=P(\{11111\})=(p)^{5}\)

\(P(Y=1)=P(\{011111,101111,110111,111011,111101\}) = \binom{5}{4}p^5(1-p)^{5-4}\)

\(P(Y=2) = \binom{6}{4}p^5(1-p)^{5-4}\)

\(P(X = k) = \binom{k+r-1}{r-1} (1-p)^kp^r\)

Mean (Expected Value)

\(E(X)=\sum_{k} k P(X=k)\)

\(E(X)= \frac{r(1-p)}{p}\)

Variance

\(V(X)= \frac{r(1-p)}{p^2}\)

Relationship between Geometric and Negative Binomial rv

\(X \sim Geom(p)\)

= Repeated, independent, identical, Bernoulli trails util first successes.

\(Y \sim NB(1,p)\)

= Count the number of failure until first success util first successes. =

\(\underbrace{}_{Failure} \underbrace{}_{Failure} success\)

Note: Y = X - 1. then E(Y) = E(X) - 1 = 1/p - 1 = \(\frac{1-p}{p}\)

\(NB(r,p)\) = \(\underbrace{}_{Failure} \underbrace{}_{Failure} success \underbrace{}_{Failure} \underbrace{}_{Failure} success \underbrace{}_{Failure} \underbrace{}_{Failure} rth success\)

means we have stack geometric rv in a row rth time. that’s why we multiply by r in expected value and variance in NB rv.

Poisson Distribution

The Poisson distribution is a discrete probability distribution that describes probabilities for counts of events that occur in a specified observation space. It is named after SimΓ©on Denis Poisson.

Suppose that an event can occur several times within a given unit of time. When the total number of occurrences of the event is unknown, we can think of it as a random variable.

Poisson Random Variable

A Poisson rv \(X \sim Poisson(\lambda)\) is a discrete rv that describes the total number of events that happen in a certain time period.

Parameter

The Poisson distribution is defined by a single parameter, lambda (Ξ»),

which is the mean number of occurrences during an observation unit. A rate of occurrence is simply the mean count per standard observation period. For example, a call center might receive an average of 32 calls per hour.

Uses
  1. # of vehicles crossing a bridge in one day

  2. # of gamma rays hitting a satellite per hour

  3. # of cookies sold at a bake sale in one hour

  4. # of customers arriving at a bank in a week

PMF

A discrete random variable X has Poisson distribution with parameter (\(\lambda\) > 0) if the probability mass function of X is

\[\begin{split} f(x)=P(X=x)= \begin{cases}\frac{e^{-\lambda} \lambda^{x}}{x !} & , x=0,1,2, \ldots \\ 0 & , \text { otherwise }\end{cases} \end{split}\]

which may also be written as

\[ f(x)=\frac{e^{-\lambda} \lambda^{x}}{x !} I_{\{0,1,2, \ldots\}}(x) \]

where

  • k is the number of occurrences (\(k = 0,1,2\dots\)) It could be zero because nothing happened in that time period.

  • e} is (e = 2.71828..)

While this pmf might appear to be highly structured, it really is the epitome of randomness. Imagine taking a 20 acre plot of land and dividing it into 1 square foot sections. (There are 871,200 sections!) Suppose you were able to scatter 5 trillion grass seeds on this land in a completely random way that does not favor one section over another. One can show that the number of seeds that fall into any one section follows a Poisson distribution with some parameter Ξ». More specifically, one can show that the Poisson distribution is a limiting case of the binomial distribution when n gets really large and p get really small. β€œSuccess” here is the event that any given seed falls into one particular section. We then want to count the number of successes in 5 trillion trials.

In general, the Poisson distribution is often used to describe the distribution of rare events in a large population.

All probabilities sum to 1

\(\sum_{k=0}^{\infty} P(X=k)=\sum_{k=0}^{\infty} \frac{\lambda^{k}}{k !} e^{-\lambda}=e^{-\lambda} \sum_{k=0}^{\infty} \frac{\lambda^{k}}{k!} = e^{-\lambda} * e^{\lambda} = 1\)

Mean (Expected Value)

\(E(X)=\sum_{k=0}^{\infty} k P(X=k)=\sum_{k=0}^{\infty} k \frac{\lambda^{k}}{k !} e^{-\lambda}=\lambda \sum_{k=1}^{\infty} \frac{\lambda^{k-1}}{(k-1) !} e^{-\lambda} = \lambda\)

\(E\left(X^{2}\right)=\sum_{k=0}^{\infty} k^{2} P(X=k)=\sum_{k=0}^{\infty} k^{2} \frac{\lambda^{k}}{k !} e^{-\lambda}=\lambda(\lambda+1)^{e}\)

Variance

\(V(X)=E\left(X^{2}\right)-(E(X))^{2}=\lambda(\lambda+1)-\lambda^{2}=\lambda\)

Continuous Distributions

Definition

A random variable is continuous if possible values comprise either a single interval on the number line or a union of disjoint intervals. X = f(x) is the probability density function of the continues random variable X.

We model a continuous random variable with a curve f(x), called a probability density function (pdf).

_images/PDF_intro.jpg https://cdn.mathpix.com/snip/images/EhhUI3_AD2OLU1c1khtVJecNQhq_KaTJbQnAQF5oKFk.original.fullsize.png
Applications
  • In the study of the ecology of a lake, a rv X could be the depth measurements at randomly chosen locations.

  • In a study of a chemical reaction, Y could be the concentration level of a particular chemical in solution.

  • In a study of customer service, W could be the time a customer waits for service.

  • f(x) represents the height of the curve at point x.

  • For continuous random variables probabilities are areas under the curve.

Attention

We can’t model continuous random variable using discrete rv method.

\[ P(a \leq X \leq b)=\int_{a}^{b} f(x) d x \]
Properties
  1. The probability density function \(f:(-\infty, \infty) \rightarrow[0, \infty) \text{ so } f (x) \geq 0\).

  2. \(P(-\infty<X<\infty)=\int_{-\infty}^{\infty} f(x) d x=1=P(S)\)

  3. \(P(a \leq X \leq b)=\int_{a}^{b} f(x) d x\)

Note

\(P(X=a)=\int_{a}^{a} f(x) d x=0 \text { for all real numbers } a\)

Uniform rv

Random variable \(X \sim U[a,b]\) has the uniform distribution on the interval [a, b] if its density function is

import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import uniform

sns.set_theme(style="darkgrid")

# random numbers from uniform distribution
n = 10000
start = 10
width = 20
data_uniform = uniform.rvs(size=n, loc = start, scale=width)
ax = sns.displot(data_uniform,
                  bins=100,
                  kde=True)
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')
plt.show()
_images/5dae0583f257edf8268fcc0763d52362d4a6d87de3f551d18539467f803db900.png
\[\begin{split} f(x)=\begin{cases} \frac{1}{b - a} & \mathrm{for}\ a \le x \le b, \\[8pt] 0 & \mathrm{for}\ x<a\ \mathrm{or}\ x>b \end{cases} = \frac{1}{b - a} \cdot I_{(a,b)}(x) \end{split}\]
CDF
\[ \begin{align}\begin{aligned} F(x)=P(X \leq x)=\int_{-\infty}^{x} f(t) dt\\= \int_{a}^{x} \frac{1}{b-a} dt \end{aligned}\end{align} \]
\[\begin{split} F(x)= \begin{cases} 0 & \text{for }x < a \\[8pt] \frac{x-a}{b-a} & \text{for }a \le x \le b \\[8pt] 1 & \text{for }x > b \end{cases} \end{split}\]
Expected Value and Variance
\[\begin{split} f(x)=\begin{cases} \frac{1}{b - a} & \mathrm{for}\ a \le x \le b, \\[8pt] 0 & \mathrm{for}\ x<a\ \mathrm{or}\ x>b \end{cases} \end{split}\]
\[\begin{split} \begin{aligned} E(X) &=\int_{a}^{b} x \cdot \frac{1}{b-a} d x=\left.\frac{1}{b-a} \frac{x^{2}}{2}\right|_{a} ^{b}=\frac{b^{2}-a^{2}}{2(b-a)}=\frac{b+a}{2} \\ E\left(X^{2}\right) &=\int_{a}^{b} x^{2} \frac{1}{b-a} d x=\left.\frac{1}{b-a} \frac{x^{3}}{3}\right|_{a} ^{b}=\frac{b^{3}-a^{3}}{3(b-a)}=\frac{b^{2}+a b+a^{2}}{3} \\ V(X) &=E\left(X^{2}\right)-(E(X))^{2} \\ &=\frac{b^{2}+a b+a^{2}}{3}-\left(\frac{b+a}{2}\right)^{2}=\frac{(b-a)^{2}}{12} \end{aligned} \end{split}\]
Example

For random variable \(X \sim U(0,23)\). Find P(2 < X < 18)

\(P(2 < X < 18) = (18-2)\cdot \frac 1 {23-0} = \frac {16}{23}\)

Exponential Distribution

The exponential distribution is a continuous probability distribution that often concerns the amount of time until some specific event happens. It is a process in which events happen continuously and independently at a constant average rate. The exponential distribution has the key property of being memoryless.

Applications

The family of exponential distributions provides probability models that are widely used in engineering and science disciplines to describe time-to-event data.

  • Time until birth

  • Time until a light bulb fails

  • Waiting time in a queue

  • Length of service time

  • Time between customer arrivals

  • the amount of money spent by the customer

  • Calculating the time until the radioactive particle decays

PDF

The continuous random variable, say X is said to have an exponential distribution, if it has the following probability density function:

\[\begin{split} \large f(x;\lambda) = \begin{cases} \lambda e^{ - \lambda x} & x \ge 0, \\ 0 & x < 0. \end{cases} =\lambda e^{-\lambda x} I_{(0, \infty)}(x) \end{split}\]

Ξ» is called the distribution rate.

import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import expon

sns.set_theme(style="darkgrid")

data_expon = expon.rvs(scale=1,loc=0,size=1000)
ax = sns.displot(data_expon,
                  kde=True,
                  bins=100)
ax.set(xlabel='Exponential Distribution', ylabel='Frequency')
plt.show()
_images/2e7d409fe349cc27c283cc2eb21e3be57a93146a0fc5c1f20f9c0c2d637e207a.png
Expected Value

The mean of the exponential distribution is calculated using the integration by parts.

\[ \begin{align}\begin{aligned}\begin{split} \begin{aligned} &E[X]=\int_{0}^{\infty} x f(x) d x=\int_0^{\infty} x \lambda e^{-\lambda x} d x \\ &=\lambda\left[\left|\frac{-x e^{-\lambda x}}{\lambda}\right|_0^{\infty}+\frac{1}{\lambda} \int_0^{\infty} e^{-\lambda x} d x\right] \\ &=\lambda\left[0+\frac{1}{\lambda} \frac{-e^{-\lambda x}}{\lambda}\right]_0^{\infty} \\ &=\lambda \frac{1}{\lambda^2} \\ &=\frac{1}{\lambda} \end{aligned}\end{split}\\\begin{split}E[X^2]&= \int_{0}^{\infty} x^2 f(x) d x \\ &= \int_{0}^{\infty} x^2 \lambda e^{ - \lambda x} d x \\ &= \frac{2}{\lambda^2} \end{split}\end{aligned}\end{align} \]
Variance

To find the variance of the exponential distribution, we need to find the second moment of the exponential distribution

\[\begin{split} V(X) &= E(X^2) - E(X)^2 \\ &= \frac{2}{\lambda^2} - (\frac{1}{\lambda})^2 \\ &= \frac{1}{\lambda^2} \end{split}\]
Properties

The most important property of the exponential distribution is the memoryless property. This property is also applicable to the geometric distribution.

Normal (Gaussian) Distribution

It is often called Gaussian distribution, in honor of Carl Friedrich Gauss (1777-1855), an eminent German mathematician who gave important contributions towards a better understanding of the normal distribution.

Normal Random Variable

A continuous random variable \(X \sim N(\mu,\sigma^2)\) has the normal distribution with parameters \(\mu\) and \(\sigma^2\) if its density is given by

\[\large f(x)=\frac{1}{\sqrt{2 \pi} \sigma} e^{-(x-\mu)^2 / 2 \sigma^2} \text { for }-\infty<x<\infty\]

A normal distribution is a distribution that is solely dependent on two parameters of the data set: mean and the standard deviation of the sample.

Normal distribution

Attention

This characteristic of the distribution makes it extremely simple for statisticians and hence any variable that exhibits normal distribution is feasible to be forecasted with higher accuracy. Essentially, it can help in simplying the model.

Parameters
  • Mu is a location parameter. If you change the value of Mu, the entire bell curve is going to slide around.

  • If you increase Sigma squared, it’s going to get fatter and therefore shorter because the total area is one, So if it gets fatter, it has to come down. If Sigma squared gets smaller, it’s going to get really tall and thin.

Properties

Normal distribution is simple to explain: Why?

The reasons are:

  1. The mean, mode, and median of the distribution are equal.

  2. We only need to use the mean and standard deviation to explain the entire distribution.

  • f(x) is symmetric around \(x=\mu\) as a consequence, deviations from the mean having the same magnitude.

  • f(x) > 0 for all \(x\) and \(\int_{-\infty}^{\infty} f(x) dx = 1\).

  • \(\mu + \sigma\) and \(\mu - \sigma\) are inflection points on f(x).

  • Mean and median are equal; both are located at the center of the distribution.

Why is it so important

The normal distribution is extremely important because:

  • many real-world phenomena involve random quantities that are approximately normal

  • it plays a crucial role in the Central Limit Theorem, one of the fundamental results in statistics;

  • its great analytical tractability makes it very popular in statistical modelling.

The following variables are close to normally distributed variables:

  1. Height of a population

  2. Blood pressure of adult human

  3. Position of a particle that experiences diffusion

  4. Measurement errors

  5. Residuals in regression

  6. Shoe size of a population

  7. Amount of time it takes for employees to reach home

  8. A large number of educational measures

But how are so many variables approximately normally distributed?

Let’s consider that the height of a population is a random variable. We can take a sample of heights, plot its distribution and calculate the sample mean. When we repeat this experiment whilst we increase the number of samples then the mean of the samples will end up being very close to normality.

This is known as the Central Limit Theorem.

Probability Density Function

If we plot the normal distribution density function, it’s curve has the following characteristics:

Normal distribution
  • Mean is the center of the curve. This is the highest point of the curve as most of the points are at the mean.

  • There is an equal number of points on each side of the curve. The center of the curve has the most number of points.

  • The total area under the curve is the total probability of all of the values that the variable can take.

  • The total curve area is therefore 100%

Normal distribution
  • Approximately 68.2% of all of the points are within the range -1 to 1 standard deviation.

  • About 95.5% of all of the points are within the range -2 to 2 standard deviations.

  • About 99.7% of all of the points are within the range -3 to 3 standard deviations.

This allows us to easily estimate how volatile a variable is and given a confidence level, what its likely value is going to be. As an instance, in the grey bell-shaped curve above, there is a 68.2% chance that the value of the variable will be within 101–99.

Normal Probability Distribution Function
\[ \large f(x)=\frac{1}{\sqrt{2 \pi} \sigma} e^{-(x-\mu)^{2} / 2 \sigma^{2}} = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2} \text { for }-\infty<x<\infty \]
Normal distribution
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm


sns.set_theme(style="darkgrid")
sample = torch.normal(mean = 8, std = 16, size=(1,1000))

sns.displot(sample[0], kde=True, stat = 'density',)
plt.axvline(torch.mean(sample[0]), color='red', label='mean')

plt.show()
_images/f2fbc9fb5f4367d1e913923a508be33cfb0183baaeb88559aaa0798f37f9c33a.png

norm.pdf returns a PDF value. The following is the PDF value when π‘₯=1, πœ‡=0, 𝜎=1.

We graph a PDF of the normal distribution using scipy, numpy and matplotlib. We use the domain of βˆ’4<π‘₯<4, the range of 0<𝑓(π‘₯)<0.45, the default values πœ‡=0 and 𝜎=1. plot(x-values,y-values) produces the graph.

print(norm.pdf(x=1.0, loc=0, scale=1))

x = torch.arange(-4,4,0.001)
fig, ax = plt.subplots()

ax.set_title('N(0,$1^2$)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.plot(x, norm.pdf(x))
ax.set_ylim(0,0.45)
plt.show()
0.24197072451914337
_images/a5e736014dcc5216d62123cb187c1e8e897dc60887cecea9504e629bcb4b69e8.png

A normal curve is smooth bell-shaped. It is symmetrical about the π‘₯=πœ‡ and has a maximum point at π‘₯=πœ‡.

Normal distribution PDF with different standard deviations

Let’s plot the probability distribution functions of a normal distribution where the mean has different standard deviations.

scipy.norm.pdf has keywords, loc and scale. The location (loc) keyword specifies the mean and the scale (scale) keyword specifies the standard deviation.

fig, ax = plt.subplots()
x = torch.arange(-4,4,0.001)

stdvs = [1.0, 2.0, 3.0, 4.0]
for s in stdvs:
    ax.plot(x, norm.pdf(x,scale=s), label='stdv=%.1f' % s)
    
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.set_title('Normal Distribution')
ax.legend(loc='best', frameon=True)
ax.set_ylim(0,0.45)
ax.grid(True)
_images/993f348ae9edc091f2dbada07151b0f5e6bcfe59e0118d59f02cbfbff6c49145.png
Normal distribution PDF with different means

Let’s plot the probability distribution functions of a normal distribution where the mean has different values.

fig, ax = plt.subplots()
x = torch.linspace(-10,10,100)

means = [0.0, 1.0, 2.0, 5.0]
for mean in means:
    ax.plot(x, norm.pdf(x,loc=mean), label='mean=%.1f' % mean)
    
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.set_title('Normal Distribution')
ax.legend(loc='best', frameon=True)
ax.set_ylim(0,0.45)
ax.grid(True)
_images/de1afdcc8d4ae2e44e9ae71f3f231ff39e94f78a97e3e412a6c50a722627825c.png

The mean of the distribution determines the location of the center of the graph. As you can see in the above graph, the shape of the graph does not change by changing the mean, but the graph is translated horizontally.

A cumulative normal distribution function

The cumulative distribution function of a random variable X, evaluated at x, is the probability that X will take a value less than or equal to x. Since the normal distribution is a continuous distribution, the shaded area of the curve represents the probability that X is less or equal than x.

fig, ax = plt.subplots()
# for distribution curve
x= torch.arange(-4,4,0.001)

ax.plot(x, norm.pdf(x))
ax.set_title("Cumulative normal distribution")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
# for fill_between
px=torch.arange(-4,1,0.01)
ax.set_ylim(0,0.5)
ax.fill_between(px,norm.pdf(px),alpha=0.5, color='g')
# for text
ax.text(-1,0.1,"cdf(x)", fontsize=20)
plt.show()
_images/4a515b328c1374fc32e4873a48a40cca73569cf978a5cb1f028faa6b191f43cc.png
Expected Value and Variance

\(E(X) = \mu\)

\(V(X) = \sigma^2\)

Problems With Normality

Assuming normality has its own flaws. As an instance, we cannot assume that the stock price follows normal distribution as the price cannot be negative. Therefore the stock price potentially follows a log of the normal distribution to ensure it is never below zero.

We know that the daily returns can be negative, therefore the returns can at times follow a normal distribution. It is not wise to assume that the variable follows a normal distribution without any analysis.

A variable can follow Poisson, Student-t, or Binomial distribution as an instance and falsely assuming that a variable follows normal distribution can lead to inaccurate results.

Identify the Normal RV?

Many methods exist for testing whether a variable has a normal distribution

1. Histogram

The histogram is a data visualization that shows the distribution of a variable. It is a bar graph that shows the frequency of each value in the variable. The histogram is a graphical representation of the distribution of a variable.

sample = torch.normal(mean = 8, std = 16, size=(1,1000))
sample2 = torch.distributions.uniform.Uniform(2,3).sample([1,1000])

sns.displot(sample[0], kde=True,).set(title='Normal Distribution')
plt.axvline(torch.mean(sample[0]), color='red', label='mean')

sns.displot(sample2[0], kde=True,).set(title='Uniform Distribution')
plt.show()
_images/f6ac69ad24d414a1dd4b241ca3310f22f16966edbcd20562141bdf837b9f992b.png _images/fced0d38c648638b39beb78a41ff2cb6792b038f8eb73986e7d2b329f2e71256.png
2. Box Plot

The Box Plot is another visualization technique that can be used for detecting non-normal samples.

The box plot is a graphical representation of the distribution of a variable. It is a graphical representation of the five-number summary of a variable. The five-number summary is the minimum, first quartile, median, third quartile, and maximum of a variable.

3. QQ Plot

QQ Plot stands for Quantile vs Quantile Plot, which is exactly what it does: plotting theoretical quantiles against the actual quantiles of our variable.

The QQ Plot allows us to see deviation of a normal distribution much better than in a Histogram or Box Plot.

Standard Normal Distribution

The normal distribution with parameter values \(\mu\) = 0 and \(\sigma^2\) = 1 is called the standard normal distribution.

A rv with the standard normal distribution is denoted by \(Z \sim N(0, 1)\)

If \(X \sim N\left(\mu, \sigma^2\right)\) then

\[ f_X(x)=\frac{1}{\sqrt{2 \pi} \sigma} e^{-(x-\mu)^2 / 2 \sigma^2} \text { for }-\infty<x<\infty \]

If \(Z \sim N(0,1)\) then

\[ f_Z(x)=\frac{1}{\sqrt{2 \pi}} e^{-x^2 / 2} \text { for }-\infty<x<\infty \]
PDF

\(f_{Z}(x)=\frac{1}{\sqrt{2 \pi}} e^{-x^{2} / 2} \text { for }-\infty<x<\infty\)

Cumulative distribution function

We use special notation to denote the cdf of the standard normal curve

\(F(z)=\Phi(z)=P(Z \leq z)=\int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} e^{-x^{2} / 2} d x\)

Cumulative distribution function for Normal distribution
Properties
  1. The standard normal density function is symmetric about the y axis.

  2. The standard normal distribution rarely occurs naturally.

  3. Instead, it is a reference distribution from which information about other normal distributions can be obtained via a simple formula.

  4. The cdf of the standard normal, \(\Phi(z)\) can be found in tables and it can also be computed with a single command in R.

  5. As we’ll see, sums of standard normal random variables play a large role in statistical analysis.

Proposition

\(\text { If } X \sim N\left(\mu, \sigma^{2}\right), \text { then } \frac{X-\mu}{\sigma} \sim N(0,1)\)

\(\frac{X-\mu}{\sigma}\) Shifted by \(\mu\) or (Centered at zero) and scaled by \(\frac{1}{\sigma}\) that will give us variance of 1.

\(\mathrm{Z} \sim \mathrm{N}(0,1) \Rightarrow \sigma \mathrm{Z}+\mu \sim \mathrm{N}\left(\mu, \sigma^2\right)\)

Example

Let \(X \sim N(2,3)\)

\[\begin{split} \begin{aligned} P ( X \leq 4.1) &= P \left(\frac{ X -\mu}{\sigma} \leq \frac{4.1-2}{\sqrt{3}}\right) \\ &= P ( Z \leq 1.21) \end{aligned} \end{split}\]
Proving this proposition

For any continuous random variable. Suppose we have Y rv, with Desnity function \(f_{Y}(y)\)

We know

\(P(y \leqslant a)=\int_{-\infty}^{a} f_{y}(y) d y\)

What if

\(P(2y \leqslant a)\)

= Can’t really use the density function until we isolate y =

\(P\left(y \leq \frac{a}{2}\right) = \int_{-\infty}^{a / 2} f_{y}(y) d y\)

This true for all transformation of Y.

With \(P\left(\frac{x-\mu}{\sigma} \leq a\right)=P(x \leq a \sigma+\mu) = \int_{x}^{a \sigma+\mu} \frac{1}{\sqrt{2 \pi} \sigma} e^{-(x-\mu)^{2} / 2 \sigma^{2}}\)

U subsitution

Let

\(u=\frac{x-\mu}{\sigma}\)

\(d u=\frac{1}{\sigma} d x\)

SO \(= \int_{-\infty}^{a} \frac{1}{\sqrt{2 \pi}} e^{-u^{2} / 2} d u\) This is density function for N(0,1).

Examples
Find P(X<2) when N(3, 2)?

In norm.cdf, the location (loc) keyword specifies the mean and the scale (scale) keyword specifies the standard deviation.

from scipy.stats import norm

val = norm.cdf(x=2, loc=3, scale=2)

print(f"P(X<2) = {val}")


fig, ax = plt.subplots()

x= torch.arange(-4,10,0.001)
ax.plot(x, norm.pdf(x,loc=3,scale=2))
ax.set_title("N(3,$2^2$)")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
# for fill_between
px=torch.arange(-4,2,0.01)
ax.set_ylim(0,0.25)
ax.fill_between(px,norm.pdf(px,loc=3,scale=2),alpha=0.5, color='g')
# for text
ax.text(-0.5,0.02,round(val,2), fontsize=20)
plt.show()
P(X<2) = 0.3085375387259869
_images/a01bfb02df54d8e7d64744692463be1e054bcf999e81cd792cff85351ddf27dd.png
\[\begin{split} \begin{aligned} \mathrm{P}(\mathrm{X} \leq 2) &=\mathrm{P}\left(\frac{\mathrm{X}-\mu}{\sigma} \leq \frac{2-3}{\sqrt{2}}\right) \\ &=\mathrm{P}(\mathrm{Z} \leq 1.21) \\ & \approx 0.30 \end{aligned} \end{split}\]

R code: pnorm(1.2)

Find P(X<4.1) when N(2, 3)?

Let \(X \sim N(2,3)\). Then

\[\begin{split} \begin{aligned} P ( X \leq 4.1) &= P \left(\frac{ X -\mu}{\sigma} \leq \frac{4.1-2}{\sqrt{3}}\right) \\ &= P (Z \leq 1.21) \\ & \approx 0.8868 \end{aligned} \end{split}\]

R Code: pnorm(1.21)

z_score <- (4.1 - 2) / sqrt(3)
pnorm(z_score)
\[\begin{split} \begin{aligned} & X _1, X _2, \ldots, X _{10} \stackrel{ id }{\sim} N (2,3) \\ &\overline{ X } \sim N \left(\mu, \sigma^2 / n \right)= N (2,3 / 10) \\ & P (\overline{ X } \leq 2.3)= P \left(\frac{\overline{ X }-\mu_{\overline{ X }}}{\sigma_{\overline{ X }}} \leq \frac{2.3-2}{\sqrt{3 / 10}}\right) \\ &\frac{\overline{ X -\mu}}{\sigma / \sqrt{ n }}=\begin{aligned} &= P ( Z \leq 0.5477) \\ & \approx 0.7081 \end{aligned} \end{aligned} \end{split}\]
Interval between variables

To find the probability of an interval between certain variables, you need to subtract cdf from another cdf.

Let’s find 𝑃(0.5<𝑋<2) with a mean of 1 and a standard deviation of 2.

print(norm(1, 2).cdf(2) - norm(1,2).cdf(0.5))
0.2901687869569368
\[ X \sim N\left(1,2^2\right), P(0.5<X<2) \]
fig, ax = plt.subplots()

# for distribution curve
x= torch.arange(-6,8,0.001)
ax.plot(x, norm.pdf(x,loc=1,scale=2))
ax.set_title("N(1,$2^2$)")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
px=torch.arange(0.5,2,0.01)
ax.set_ylim(0,0.25)
ax.fill_between(px,norm.pdf(px,loc=1,scale=2),alpha=0.5, color='g')
pro=norm(1, 2).cdf(2) - norm(1,2).cdf(0.5)
ax.text(0.2,0.02,round(pro,2), fontsize=20)
plt.show()
_images/a6f77d27d2d5208885f3d3491a9c19bc985da2943fae9dbd14c4c752f3f36308.png
P(Z > 1.25) ?

Let’s plot a graph.

fig, ax = plt.subplots()
x= torch.arange(-4,4,0.01)
gr4sf=norm.sf(x=1.25, loc=0, scale=1)
print(gr4sf)

ax.plot(x, norm.pdf(x,loc=0,scale=1))
ax.set_title("N(0,$1^2$)")

ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')

ax.grid(True)
px=torch.arange(1.25,4,0.01)
#ax.set_ylim(0,0.15)
ax.fill_between(px,norm.pdf(px,loc=0,scale=1),alpha=0.5, color='g')
ax.text(1.25,0.02,"sf(x) %.2f" %(gr4sf), fontsize=20)
plt.show()
0.10564977366685535
_images/6f71f4ae813e32d1faebaa0efcfc80e9e0b48457da566f9d280354d0e93e4ef1.png

we can use sf which is called the survival function and it returns 1-cdf.

If X = N(1, 4), find P(0 < X < 3.2)?
\[\begin{split} P(0 \leq X \leq 3.2)&=\int_{0}^{3.2} f_{X}(x) d x \\ &=P\left(\frac{0-1}{2} \leqslant \frac{x-1}{2} \leqslant \frac{3.2-1}{2}\right) \\ &=P\left(-\frac{1}{2} \leq Z \leq 1.1\right)\\ &=P(z \leq 1.1)-P\left(z<-\frac{1}{2}\right)\\ &=\Phi(1.1)-\Phi\left(-\frac{1}{2}\right)\\ &= .558 \end{split}\]
print(norm(1, 2).cdf(3.2) - norm(1,2).cdf(0))
0.5557964003276304
fig, ax = plt.subplots()

# for distribution curve
x= torch.arange(-6,8,0.001)
ax.plot(x, norm.pdf(x,loc=1,scale=2))
ax.set_title("N(1,$2^2$)")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
px=torch.arange(0.5,2,0.01)
ax.set_ylim(0,0.25)
ax.fill_between(px,norm.pdf(px,loc=1,scale=2),alpha=0.5, color='g')
pro=norm(1, 2).cdf(3.5) - norm(1,2).cdf(0)
ax.text(0.2,0.02,round(pro,2), fontsize=20)
plt.show()
_images/98ede88534ca15d3652eef2d1a9add81a51a2727b2a509ba07778903d851e907.png

Gamma Distribution

The gamma distribution term is mostly used as a distribution which is defined as two parameters – shape parameter and inverse scale parameter, having continuous probability distributions. Its importance is largely due to its relation to exponential and normal distributions.

Gamma distributions have two free parameters, named as alpha (Ξ±) and beta (Ξ²), where;

  • Ξ± = Shape parameter

  • Ξ² = Rate parameter (the reciprocal of the scale parameter)

The scale parameter Ξ² is used only to scale the distribution. This can be understood by remarking that wherever the random variable x appears in the probability density, then it is divided by Ξ². Since the scale parameter provides the dimensional data, it is seldom useful to work with the β€œstandard” gamma distribution, i.e., with Ξ² = 1.

Gamma function:

The gamma function \([10]\), shown by \(\Gamma( x )\), is an extension of the factorial function to real (and complex) numbers. Specifically, if \(n \in\{1,2,3, \ldots\}\), then

\[ \Gamma( n )=( n -1) ! \]
Probability Density Function
\[ f(x)=\frac{1}{\Gamma(\alpha)} \beta^\alpha x^{\alpha-1} e^{-\beta x } \]
Mean
\[\begin{split} \begin{aligned} \mu &= E [ X ]=\int_{-\infty}^{\infty} xf ( x ) dx \\ &=\int_0^{\infty} x \frac{1}{\Gamma(\alpha)} \beta^\alpha x ^{\alpha-1} e ^{-\beta x } dx \\ &=\frac{\alpha}{\beta} \end{aligned} \end{split}\]
Variance
\[\begin{split} \begin{aligned} \sigma^2=\operatorname{Var}[ X ] &= E \left[( X -\mu)^2\right] \\ &= E \left[ X ^2\right]-( E [ X ])^2 \\ &=\cdots=\frac{\alpha}{\beta^2} \end{aligned} \end{split}\]

Chi-squared Distribution

One Parameter:

  • degrees of freedom: \(n \geq 1\) ( \(n\) is an integer) \(X \sim \chi^2(n)\) is defined as \(\Gamma\left(\frac{n}{2}, \frac{1}{2}\right)\)

mean
\[ \mu= E [ X ]= n \]
variance
\[ \sigma^2=\operatorname{Var}[ X ]=2 n \]

T-distribution

Let \(Z \sim N(0,1)\) and \(W \sim \chi^2(n)\) be independent random variables. Define

\[ T =\frac{ Z }{\sqrt{ W / n }} \]

the t-distribution Write \(X \sim t ( n )\) One Parameter: degrees of freedom: \(n \geq 1\) ( \(n\) is an integer) The pdf:

\[\begin{split} \begin{array}{r} f(x)=\frac{\Gamma\left(\frac{n+1}{2}\right)}{\sqrt{\pi n} \Gamma\left(\frac{n}{2}\right)}\left(1+\frac{x^2}{n}\right)^{-(n+1) / 2} \\ -\infty<x<\infty \end{array} \end{split}\]

Joint Distributions

Discrete Definition

Given two discrete random variables, X and Y , p(x, y) = P(X = x, Y = y) is the joint probability mass function for X and Y .

Important property X and Y are independent random variables if P(X = x, Y = y) = P(X = x)P(Y = y) for all possible values of x and y.

\(f(x,y) = P(X=x \, and \, Y=y) = P(X=x,Y=y)\)

https://cdn.mathpix.com/snip/images/a6mE2Yo5ORHTWtZDMnpVzTvJvL69xTELSymliMdALjs.original.fullsize.png
  • Sum of all marginal probabilities is equal to 1. ( P(y=0) + P(y=100) = P(y = 200) = 1 )

  • Sum of all joint probabilities is equal to 1.

https://cdn.mathpix.com/snip/images/8fycDrKQhcRZcJw-tMLlD0LZ-f-a_atVn0kRFQAo1FA.original.fullsize.png
Marginal Probabilities
https://cdn.mathpix.com/snip/images/o-d2bytuq6UKjG3CI4QAAYoNqe6oeWkiyt8T4hO_aYY.original.fullsize.png https://cdn.mathpix.com/snip/images/EGxK3sr8ldSWbOTfQKzCrG79rFd7Rmb3Mg9cnFL4w0M.original.fullsize.png
Example

An insurance agency services customers who have both a homeowner’s policy and an automobile policy. For each type of policy, a deductible amount must be specified. For an automobile policy, the choices are $100 or $250 and for the homeowner’s policy, the choices are $0, $100, or $200.

https://cdn.mathpix.com/snip/images/hkcqjPju1fkujlSsd8eaXcHp14U2gOUaEGveABxceQ0.original.fullsize.png

Recall Two events are independent if P(A and B) = P(A)P(B) for all possible values of A and B.

P(x=100,y=100) = .1
P(x=100) p(y=200) = (.5)(.25) =.125

X and y are not independent.

Continuous Definition

Definition: If X and Y are continuous random variables, then f(x, y) is the joint probability density function for X and Y if \(P(a \leq X \leq b, c \leq Y \leq d)=\int_{a}^{b} \int_{c}^{d} f(x, y) d x d y\) for all possible $a, b, c$, and $d$ Important property: $X$ and $Y$ are independent random variables if $f(x, y)=f(x) f(y)$ for all possible values of $x$ and $y$.

Example

Example: Suppose a room is lit with two light bulbs. Let \(X_{1}\) be the lifetime of the first bulb and \(X_{2}\) be the lifetime of the second bulb. Suppose \(X_{1} \sim {Exp}\left(\lambda_{1}=1 / 2000\right)\) and \(X_{2} \sim {Exp}\left(\lambda_{2}=1 / 3000\right)\). If we assume the lifetimes of the light bulbs are independent of each other, find the probability that the room is dark after 4000 hours.

\(E\left(X_{1}\right)=\frac{1}{\lambda_{1}}=2000 \mathrm{hrs}, E\left(X_{2}\right)=\frac{1}{\lambda_{2}}=3000 \mathrm{hrs}\).

Light bulbs function independently,so

\[ \begin{align}\begin{aligned}P\left(X_{1} \leq 4000, X_{2} \leq 4000\right)=P\left(X_{1} \leq 4000\right) P\left(X_{2} \leq 4000\right)\\=\left(\int_{0}^{4000} \lambda_{1} e^{-\lambda_{1} x_{1}} d x_{1}\right)\left(\int_{0}^{4000} \lambda_{2} e^{-\lambda_{2} x_{2}} d x_{2}\right)\\=\left.\left.\left(-e^{-\lambda_{1} x_{1}}\right)\right|_{0} ^{4000}\left(-e^{-\lambda_{2} x_{2}}\right)\right|_{0} ^{4000}\\\begin{split}\begin{aligned}=\left(1-e^{-4000 / 2000}\right)\left(1-e^{-4000 / 3000}\right) &=\left(1-e^{-2}\right)\left(1-e^{-4 / 3}\right) \\ & \simeq .6368 \end{aligned}\end{split}\end{aligned}\end{align} \]

Covariance and Correlation

Covariance

The covariance between two rv’s, X and Y, is defined as

\(\operatorname{Cov}(X, Y)=E[(X-E(X))(Y-E(Y))] = E[(X- \mu_x))(Y- \mu_y)]\)

\[\begin{split}\operatorname{Cov}(X, Y)=\left\{\begin{array}{c} \sum_{x} \sum_{y}\left(x-\mu_{X}\right)\left(y-\mu_{Y}\right) P(X=x, Y=y) \\ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty}\left(x-\mu_{X}\right)\left(y-\mu_{Y}\right) f(x, y) d x d y \end{array}\right.\end{split}\]

The covariance depends on both the set of possible pairs and the probabilities for those pairs.

  • If both variables tend to deviate in the same direction (both go above their means or below their means at the same time), then the covariance will be positive.

  • If the opposite is true, the covariance will be negative.

  • If X and Y are not strongly (linearly) related, the covariance will be near 0.

https://cdn.mathpix.com/snip/images/9KZ-5o_ZqiQ0LW25nUj58r_2RU40AbNPD4iqZy3NR9E.original.fullsize.png
Computational formula for Covariance

\(\operatorname{Cov}(X, Y)=E[XY] -E[X]E[Y]\)

Correlation Coefficient

The correlation Coefficient of X and Y , denoted by Cor(X, Y ) Represented by the Greek letter β€˜β€™Οβ€™β€™ (rho)

\(Cor(X, Y) = \rho_{X,Y}= \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}\)

It represents a β€œscaled” covariance. The correlation is always between -1 and 1.

Transformations of Distributions

Discrete Distributions

Suppose that 𝖷 ∼ 𝖻𝗂𝗇(𝗇, 𝗉) What is the distribution of Y = n-X?

\(f(x)=P(X=x)= \binom{n}{x}p^x(1-p)^{n-x} \cdot I_{\{1,2,3, \ldots\}}(x)\)

Just do it:

\(P(Y=y)=P(n-X=y)=P(X=n-y)\)
\(= \binom{n}{n-y}p^x(1-p)^{n-(n-y)} \cdot I_{\{0,1,2,3, \ldots\}}(n-y)\)
\(= \binom{n}{y}p^n-y(1-p)^{y} \cdot I_{\{0,1,2,3, \ldots\}}(y) = 𝖸 ∼ 𝖻𝗂𝗇 (𝗇, 𝟣 βˆ’ 𝗉)\)
Continuous Distributions
Invertible functions

In the most general sense, are functions that β€œreverse” each other. For example, if f takes a to b, then the inverse, \(f^{-1}\) must take b to a. a function is invertible only if each input has a unique output. That is, each output is paired with exactly one input. That way, when the mapping is reversed, it will still be a function!

https://cdn.mathpix.com/snip/images/5XjLATEE1cUABbzPrffVRvF3B267cw-bYb8fpihmp1M.original.fullsize.png

For X discrete or continuous, the cumulative distribution function (cdf) Is denoted by F(x) and is defined by

\(F(X)= P(X < x)\)

https://cdn.mathpix.com/snip/images/0koe85iCdU9TJzUBMxXDNWtyn-Nd7T1yxoG0fY7gr-4.original.fullsize.png https://cdn.mathpix.com/snip/images/8FRSH7K9xdXqi68kbZcjUX6YQGv3MFsn1wmCvzSJu7E.original.fullsize.png https://cdn.mathpix.com/snip/images/DIljDw1WrQM_rQQ2vR8-kTs4vXwzHJbE94BrRIZRed4.original.fullsize.png

Estimators and Sampling Distributions

We have learned many different distributions for random variables and all of those distributions had parameters: the numbers that you provide as input when you define a random variable.

What if we don’t know the values of the parameters. What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data.

These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this

  1. specify a probabilistic model that has parameters.

  2. Learn the value of those parameters from data.

Both of these schools of thought assume that your data are independent and identically distributed (IID) samples.

Random Sample

A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.

Random Sample = \(X_1, X_2, X_3, ..., X_n\)

Suppose that \(X_1, X_2, X_3, ..., X_n\) is a random sample from the gamma distribution with parameters \(alpha\) and \(\beta\).

\[X_{1},X_{2}, \ldots, X_{n} \stackrel{\mathrm{iid}}{\sim} \Gamma(\alpha, \beta)\]

E.g

A good example is a succession of throws of a fair coin: The coin has no memory, so all the throws are β€œindependent”. And every throw is 50:50 (heads:tails), so the coin is and stays fair - the distribution from which every throw is drawn, so to speak, is and stays the same: β€œidentically distributed”.

Independent and identically distributed random variables (IID)

Random Sample == IID

Note

What are biased and unbiased estimators? A biased estimator is one that deviates from the true population value. An unbiased estimator is one that does not deviate from the true population parameter.

Parameters

Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution.

  • In the case of a Bernoulli random variable, the single parameter was the value p.

  • In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value.

we are going to use the notation \(\theta\) to be a vector of all the parameters.

Distribution

Parameters

Bernoulli(p)

\(\theta = p\)

Poisson(Ξ»)

\(\theta = \lambda\)

Uniform(a,b)

\(\theta = (a,b)\)

Normal

\(\theta = (\mu,\sigma)\)

\(Y = wX + b\)

\(\theta = (w,b)\)

Sampling Distributions

\(\theta\) will denote a generic parameter.

E.g

\(\theta = \mu , \theta = p , \theta = \lambda , \theta = (\alpha, \beta)\)

Estimator

\(\hat{\theta}\) = a Random variable,

\(\hat{\theta}=\bar{X}\)

Estimate

\(\hat{\theta}\) = a observed number

\(\hat{\theta}=\bar{x} = 42.5\)

https://cdn.mathpix.com/snip/images/FHayA6rumuEuRTs3FBcp4TSwOAhRTpLl_3HSJXTSovo.original.fullsize.png
  • We want our estimator of to be correct β€œon average.

  • \(\bar{X}\) is a random variable with its owo distribution and its own mean or expected value.

We would like sample mean \(𝖀[\bar{𝖷}] = ΞΌ\) to be close to the true mean or population mean \(ΞΌ\).

Important

  • If this is true, we say that \(\bar{𝖷}\) is an unbiased estimator of \(\mu\).

  • In general, \(\bar{\theta}\) is an unbiased estimator of \(\theta\). if \(E[\bar{\theta}] = \theta\).

That’s is really good thing.

Mean

Let X1, X2, …, Xn be random sample from any distribution with mean \(\mu\).

That is \(E[X_i] = \mu\) for i = 1,2,3,…, n. Then

\[ \begin{align}\begin{aligned}E[\bar{X}]=E\left[\frac{1}{n} \sum_{i=1}^{n} X_{i}\right] =\frac{1}{n} \sum_{i=1}^{n} E\left[X_{i}\right]\\=\frac{1}{n} \sum_{\mathrm{i}=1}^{\mathrm{n}} \mu=\frac{1}{\mathrm{n}}(\mu+\mu+\cdots+\mu)=\frac{1}{\mathrm{n}} \mathrm{n} \mu=\mu\end{aligned}\end{align} \]

We have shown that, no matter what distribution we are working with, if the mean is \(\mu\) , \(\bar{X}\) is an unbiased estimator for \(\mu\).

Attention

We have shown that, no matter what distribution we are working with, if the mean \(\mu\) is , \(\bar{X}\) is an unbiased estimator for \(\mu\) .

Let X1, X2, …, Xn be random sample from any 𝖾𝗑𝗉(rate = \(\lambda\))

Let \(\bar{X}=\frac{1}{n} \sum_{i=1}^{n} X_{i}\) is the sample mean. We know, for the exponential distribution, that \(E[X_i]=\frac{1}{\lambda}\).

Then \(E[\bar{X}] = \frac{1}{\lambda}\)

Variance

Let X1, X2, …, Xn be random sample from any distribution with mean \(\mu\) and variance \(\sigma^2\).

  • We already know that \(\bar{X}\) is an unbiased estimator for \(\mu\) .

  • What can we say about the variance of \(\bar{X}\)?

\(Var[\bar{X}]=Var\left[\frac{1}{n} \sum_{i=1}^{n} X_{i}\right]= =\frac{1}{n^{2}} Var\left[\sum_{i=1}^{n} X_{i}\right] = =\frac{1}{n^{2}} \sum_{i=1}^{n} Var\left[X_{i}\right]\)

\(=\frac{1}{n^{2}} \sum_{i=1}^{n} \sigma^{2} = \frac{1}{n^{2}} n \sigma^{2} =\frac{\sigma^{2}}{\mathrm{n}}\)

Moments Generating Functions

We are still, believe it or not, trying to estimate things from a larger population based on a sample. For example, sample mean, or maybe the sum of the values in the sample etc. Any function of your data is known as a statistic. And we’re going to use them to estimate other things. And in order to figure out how well we’re doing, we’re going to need to know often the distributions of some of these statistics.

Distributions of sums

A lot of them depend on sums, so we’re going to start out by talking about the distribution of sums of random variables.

Suppose That,

\[\begin{split}X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} Bernoulli(p) \\ \text { What is the distribution of } Y=\sum_{i=1}^{n} X_{i} ? \\ \text { Sum of Bernoulli rv is equal to bin(n,p) } \\ Y=\sum_{i=1}^{n} X_{i} \sim bin(n, p)\end{split}\]

Each X_i take value success (P) and failure (1-P). So summing all X_i is equal to sum of all success gives the value of Y. Which is binomial distribution.

Caution

Not all random variables are so easily interpreted by methods of Distributions of sums. So we need a tool.

Moment generating functions

The moments generating functions are the functions that generate the moments of a random variable. The expected values \(E(X), E\left(X^{2}\right), E\left(X^{3}\right), \ldots E\left(X^{r}\right)\) are called moments.

  • Mean \(\mu=E(X)\)

  • Variance \(\sigma^{2}=Var(X)=E\left(X^{2}\right)-\mu^{2}\)

which are functions of moments. moment-generating functions can sometimes make finding the mean and variance of a random variable simpler.

Let X be a random variable. It’s moment generating function (mgf) is denoted and defined as

Continuous Random Variables:

\(M_{X}(t)=E\left[e^{t X}\right]=\int_{-\infty}^{\infty} e^{t x} f_{X}(x) d x\)

Discrete Random Variables:

\(M_{X}(t)=E\left[e^{t X}\right]=\sum_{x} e^{t x} f_{x}(x)\)

where \(f_{X}(x)\) is the distribution of X.

Properties
  • Moment generating functions also uniquely identify distributions.

MGT of Famous Distributions

Bernoulli(p)
\[ \begin{align}\begin{aligned}M_{X}(t)=E\left[e^{t X}\right]=\sum_{x} e^{t x} f_{X}(x)=\sum_{x} e^{t x} P(X=x)\\=e^{t \cdot 0} P(X=0)+e^{t \cdot 1} P(X=1)\\=1 \cdot(1-p)+e^{t} \cdot p\\=1-p+p e^{t}\end{aligned}\end{align} \]
Binomial(n,p)

\(X \sim bin(n, p)\)

\[\begin{split}M_{x}(t)=\sum_{x=0}^{n}e^{tx}\binom{n}{x}p^x(1-p)^{n-x} \\ M_{x}(t)=\sum_{x=0}^{n}e^{tx}\binom{n}{x}(pe^t)^x(1-p)^{n-x}\end{split}\]
Binomial Theorem:

\((a + b)^n =\sum_{k=0}^{n}\binom{n}{k}a^k b^{n-k}\)

\[M_{X}(t)=(1-p+p e^{t})^n\]

Finding Distributions

A moment-generating function uniquely determines the probability distribution of a random variable. if two random variables have the same moment-generating function, then they must have the same probability distribution.

https://cdn.mathpix.com/snip/images/oLROi0YuJYc_kDzSYRACNdujNGLM3Qx_TPXKcbVE-qA.original.fullsize.png
Some distribution with \(X_{1}, X_{2}, \ldots, X_{n} \text { iid }\) and \(Y=\sum_{i=1}^{n} X_{i}\) .
\(M_{Y}(t)=\left[M_{X_{1}}(t)\right]^{n}\)

We have just seen that the moment generating function of the sum. Is the moment generating function of one of them raised to the nth power.

Key points
  • sum of n iid Bernoulli(p) random variables is bin(n, p)

  • sum of n iid exp(rate =lambda) random variables is Gamma(n, lambda)

  • sum of m iid bin(n,p) is bin(nm,p)

  • sum of n iid Gamma(alpha, beta) is Gamma(n alpha, beta)

  • sum of n iid \(N\left(\mu, \sigma^{2}\right) is N\left(n \mu, n \sigma^{2}\right)\).

  • sum of $n$ independent normal random variable with \(\mathrm{X}_{\mathrm{i}} \sim \mathrm{N}\left(\mu_{\mathrm{i}}, \sigma_{\mathrm{i}}^{2}\right)$ is $\mathrm{N}\left(\sum_{\mathrm{i}=1}^{\mathrm{n}} \mu_{\mathrm{i}}, \sum_{\mathrm{i}=1}^{\mathrm{n}} \sigma_{\mathrm{i}}^{2}\right)\)

https://cdn.mathpix.com/snip/images/RMOVPa5JmcmotyZpHqJltBKGFWCR9TppB3mE-qv5lrE.original.fullsize.png https://cdn.mathpix.com/snip/images/MzckM7PfGQ5KY2kijc-uhpk9fcnSA4wNB7f6ID40Ilg.original.fullsize.png https://cdn.mathpix.com/snip/images/csHt3NpVjl_KQpz7Tss5DiZ-0bjOP_XlzKrvRDDECVg.original.fullsize.png

Method of Moments Estimators(MMEs)

Method of moments means you set sample moments equal to population/theoretical moments.

It totally makes sense if you’re trying to estimate the mean or average out there in the entire population. That you should use the sample mean or sample average of the values in the sample, but what about parameters with not such an obvious interpretation?

Idea: Equate population and sample moments and solve for the unknown parameters.

Suppose that \(X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} \Gamma(\alpha, \beta)\)
How can we estimate Ξ± ?
We could estimate the true mean \(\alpha / \beta\) with the sample mean \(\bar{X}\) , but we still can’t get at Ξ± if we don’t know Ξ².

Attention

Recall that the β€œmoments” of a distribution are defined as 𝖀[𝖷], 𝖀[π–·πŸ€], 𝖀[𝖷πŸ₯ ], … These are distribution or β€œpopulation” moments

  • \(\mu=E[X]\) is a probability weighted average of the values in the population.

  • \(\bar{X}\) is the average of the values in the sample.

It was natural for us to think about estimating $mu$ with the average in our sample.

  • \(\mathrm{E}\left[\mathrm{X}^{2}\right]\) is a probability weighted average of the squares of the values in the population.

It is intuitively nice to estimate it with the average of the squared values in the sample:

\[ \begin{align}\begin{aligned}\frac{1}{n} \sum_{i=1}^{n} X_{i}^{2}\\\text{The kth population moments:}\\\mu_{\mathrm{k}}=\mathrm{E}\left[\mathrm{X}^{\mathrm{k}}\right] \quad \mathrm{k}=1,2,3, \ldots\\\text{The kth population moments:}\\\mu_{\mathrm{k}}=\mathrm{E}\left[X^{\mathrm{k}}\right] \quad \mathrm{k}=1,2,3, \ldots\\\text{The kth sample moments:}\\M_{k}=\frac{1}{n} \sum_{i=1}^{n} X_{i}^{k} \quad k=1,2,3, \ldots\end{aligned}\end{align} \]
Eg
\[ \begin{align}\begin{aligned}X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} \exp (\text { rate }=\lambda)\\\text{First population moment:}\\\mu_{1}=\mu=\mathrm{E}[\mathrm{X}]=\frac{1}{\lambda}\\\text{First sample moment:}\\M_{1}=\frac{1}{n} \sum_{i=1}^{n} X_{i}=\bar{X}\\\text{Equate:} \frac{1}{\lambda}=\bar{x}\\\text{Solve for the unknown parameter...} \lambda=\frac{1}{\bar{x}}\\\text{The MME is } \hat{\lambda}=\frac{1}{\bar{x}}\end{aligned}\end{align} \]

Maximum Likelihood Estimation

Idea

Choose the value in the parameter space that makes the observed data β€œmost likely”.

Suppose that we flip a biased coin which has the probability of getting β€œHeads” as either 0.2, 0.3, or 0.8. Suppose that we flip the coin 20 times and see the results:
Sample Space: H, H, T, H, H, H, H, T, H, H, H, H, H, T, H, H, H, H, H , H
Which of 0.2, 0.3, or 0.8 seems β€œmost likely”?

What if we only flip the coin twice? For i=1,2, let \(X_{i}=\begin{cases}1 & \text { if we get "Heads" on the ith flip } \\ 0, & \text { if we get "Tails" on the ith flip }\end{cases}\)

Let p=P(β€œHeads” on any one flip) Then X1, X2 ∼ Bernoulli(P) iid where 𝗉 ∈ {𝟒 . 𝟀, 𝟒 . πŸ₯, 𝟒 . πŸͺ}

Joint pmf Due to independence of the variables, we can write the joint pmf as

\[ \begin{align}\begin{aligned}\begin{split}\begin{aligned} f\left(x_{1}, x_{2}\right) &=P\left(X_{1}=x_{1}, X_{2}=x_{2}\right) \\ &=P\left(X_{1}=x_{1}\right) \cdot P\left(X_{2}=x_{2}\right) \end{aligned}\end{split}\\=p^{x_{1}}(1-p)^{1-x_{1}} \mathrm{I}_{\{0,1\}}\left(\mathrm{x}_{1}\right) \cdot \mathrm{p}^{\mathrm{x}_{2}}(1-p)^{1-\mathrm{x}_{2}} \mathrm{I}_{\{0,1\}}\left(\mathrm{x}_{2}\right)\\\begin{split}\text{if p=0.2 and (0,0)} = 0.2^0 \times (1 - 0.2)^0 \times 0.2^0 \times (1 - 0.2)^0 = 0.64 \\ \text{if p=0.8 and (0,1)} = 0.8^0 \times (1 - 0.8)^0 \times 0.8^1 \times (1 - 0.8)^1 = 0.16\end{split}\end{aligned}\end{align} \]

Tabulated values of the joint pmf

https://cdn.mathpix.com/snip/images/njax3wQTyJtFK4ii3YdkkwlqdeJ1iHNU9_CwYsGr21Y.original.fullsize.png
  • When we observe the data to be (0,0) i.e. (β€œTails”, β€œTails”), the value of p that gives the highest joint probability (0.64) is 0.2.

  • When we observe the data to be (0,1) or (1,0) i.e. (β€œTails”, β€œHeads”) or (β€œHeads”, β€œTails”), the value of p that gives the highest joint probability (0.21) is 0.3.

  • When we observe the data to be (1,1) i.e. (β€œHeads”, β€œHeads”), the value of p that gives the highest joint probability (0.64) is 0.8.

The maximum likelihood estimator for p is:

\[\begin{split}\widehat{p}= \begin{cases}0.2 & \text {, if }\left(x_{1}, x_{2}\right)=(0,0) \\ 0.3 & , \text { if }\left(x_{1}, x_{2}\right)=(0,1) \text { or }(1,0) \\ 0.8 & \text {, if }\left(x_{1}, x_{2}\right)=(1,1)\end{cases}\end{split}\]

Introduction

Given data \(X_1, X_2 ... X_n\), a random sample (iid) from a distribution with unknown parameter ΞΈ, we want to find the value of ΞΈ in the parameter space that maximizes our probability of observing that data.

For Discrete…

If \(X_1, X_2 ... X_n\) are discrete, we can look at \(P\left(X_{1}=x_{1}, X_{2}=x_{2}, \ldots, X_{n}=x_{n}\right)\) as a function of ΞΈ, and find the ΞΈ that maximizes it. This is the joint pmf for \(X_1, X_2 ... X_n\).

For Continuous…

If \(X_1, X_2 ... X_n\) are continuous is to maximize the joint pdf with respect to ΞΈ.

For Discrete…

If \(X_1, X_2 ... X_n\) are discrete, we can look at \(P\left(X_{1}=x_{1}, X_{2}=x_{2}, \ldots, X_{n}=x_{n}\right)\) as a function of ΞΈ, and find the ΞΈ that maximizes it. This is the joint pmf for \(X_1, X_2 ... X_n\).

For Continuous…

If \(X_1, X_2 ... X_n\) are continuous is to maximize the joint pdf with respect to ΞΈ.

The pmf/pdf for any one of is denoted by f(x). We will emphasize the dependence of f on a parameter ΞΈ by writing it as

\[ \begin{align}\begin{aligned}f(x) = f(x; \theta)\\\text{The joint pmf/pdf for all n of them is}\\f\left(x_{1}, x_{2}, \ldots, x_{n} ; \theta\right) = \prod_{i=1}^{n} f\left(x_{i} ; \boldsymbol{\theta}\right)\\f(\vec{x} ; \boldsymbol{\theta})=\prod_{i=1}^{n} f\left(x_{i} ; \boldsymbol{\theta}\right)\end{aligned}\end{align} \]
  • The data (the x’s) are fixed.

  • Think of the x’s as fixed and the joint pdf as a function of ΞΈ.

Given the joint PDF, the data, the Xs are fixed, and we think of it as a function of theta and we want to find the value of theta that maximizes the joint probability density function or probability mass function.

Likelihood function

If we think of this as a function of theta, and the x’s as fixed, we’re going to rename the joint PDF. We’re going to call it a likelihood function and write it as a capital L of theta L(ΞΈ).

https://cdn.mathpix.com/snip/images/6x2RRobffl10lBr_hRqa04kDpnXBSS9DGIBc7gc-o_4.original.fullsize.png

Attention

Because I can multiply or divide my likelihood by a constant and not change where the maximum occurs, then we can actually define the likelihood to be anything proportional to the joint pdf. So we can throw out multiplicative constants, including multiplicative constants that involve Xs.

Bernoulli distribution

\(X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} \text { Bernoulli }(p)\)

The pmf for one of them is \(f(x ; p)= p^{x}(1-p)^{1-x} I_{\{0,1\}}(x)\)
The joint pmf for all of them is
\[f(\vec{x} ; p) = \prod_{i=1}^{n} f\left(x_{i} ; p\right) = \prod_{i=1}^{n} p^{x_{i}}(1-p)^{1-x_{i}} I_{\{0,1\}}\left(x_{i}\right)\]

The joint probability mass function we’ll get by multiplying the individual ones together, because these guys are IID independent and identically distributed. Now, fix the Xs. Those are stuck, fixed, not moving, and think of this as a function of p. The values of p that are allowed, the parameter space for this model, are all values of p between 0 and 1.

For example I have p^X_1 times p^X_2 times p^X_3 and that’s going to be p to the sum of the Xs, and I’ve got 1 minus p^1 minus X_1, 1 minus p^1 minus X_2. If I add up those exponents, I’m going to get an exponent of n minus the sum of the Xs, and I do have a product of indicators.

\[=p^{\sum_{i=1}^{n} x_{i}}(1-p)^{n-\sum_{i=1}^{n} x_{i}} \prod_{i=1}^{n} I_{\{0,1\}}\left(x_{i}\right)\]

Drop the indicator stuff, so that is a multiplicative constant which is constant with respect to p. I think I’m going to drop it. Why not make it simpler?

\[\text{A likelihood is } L(p)=p^{\sum_{i=1}^{n} x_{i}}(1-p)^{n-\sum_{i=1}^{n} x_{i}}\]
Log-likelihood

It is almost always easier to maximize the log-likelihood function due to properties of Logarithms.

\(ln(uv) = ln(u) + ln(v) \text{ and } ln(n)^V = v \times ln(n)\)

Important

The log function is an increasing function. So the log of the likelihood is going to have different values than the likelihood, but because log is increasing, this is not going to mess up the location of the maximum.

\[\begin{split}L(p)=\log\left(\prod_{i=1}^{n} p^{x_{i}}(1-p)^{1-x_{i}} I_{\{0,1\}}\left(x_{i}\right)\right) \\ \ell(p)=\sum_{i=1}^{n} x_{i} \ln p+\left(n-\sum_{i=1}^{n} x_{i}\right) \ln (1-p)\end{split}\]

I want to maximize it with respect to p, so I'm going to take a derivative with respect to p and set it equal to 0.

\[\begin{split}\frac{\partial}{\partial p} l(p)=\frac{\sum_{i=1}^{n} x_{i}}{p}-\frac{n-\sum_{i=1}^{n} x_{i}}{1-p} \stackrel{\text { set }}{=} 0 \\ p(1-p)\left[\frac{\sum_{i=1}^{n} x_{i}}{p}-\frac{n-\sum_{i=1}^{n} x_{i}}{1-p}\right]=p(1-p) \cdot 0 \\ (1-p) \sum_{i=1}^{n} x_{i}-p\left(n-\sum_{i=1}^{n} x_{i}\right)=0\end{split}\]
https://cdn.mathpix.com/snip/images/rkijCeDy35aP_A_nRceFCUJFEL90Igt1L7UhesnYDSs.original.fullsize.png
\[\hat{p}=\frac{\sum_{i=1}^{n} x_{i}}{n}\]

This is our coin example again. But we have n flips, and we have the Bernoulli’s ones and zeros for heads and tails, and the value of p is unknown, it’s somewhere between 0 and 1. We’re no longer restricted to 0.2, 0.3, and 0.8. The maximum likelihood estimator, is the sample mean of the ones and zeros. If you add up the ones and zeros, and divide by n, you’re really computing the proportion of ones in your sample. You’re really computing the proportion of times you see heads in your sample. This maximum likelihood estimator, at least, in this case, makes a lot of sense.

\[\hat{p}=\frac{\sum_{i=1}^{n} X_{i}}{n}=\bar{X}\]
Q/A
Is maximum likelihood estimator Bernoulli unbiased?

the maximum likelihood estimator of is a biased estimator. Recall that if \(X_i\) is a Bernoulli random variable with parameter P, then \(E[X_i] = p\).

\[\begin{split}E(\hat{p})=E\left(\frac{1}{n} \sum_{i=1}^{n} X_{i}\right)=\frac{1}{n} \sum_{i=1}^{n} E\left(X_{i}\right)=\frac{1}{n} \sum_{i=1}^{n} p=\frac{1}{n}(n p)=p \\\end{split}\]

Exponential distribution

\(X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} \text { Exponential }(rate =\lambda)\)

The pmf for one of them is \(f(x ; p)= \lambda e^{-\lambda x} I_{(0, \infty)}(x)\)
The joint pmf for all of them is
\[\begin{split}f(\vec{x} ; \lambda)=\prod_{i=1}^{n} f\left(x_{i} ; \lambda\right) =\prod_{i=1}^{n} \lambda e^{-\lambda x_{i}} I_{(0, \infty)}\left(x_{i}\right) \\ f(\vec{x} ; p)=\lambda^{n} e^{-\lambda \sum_{i=1}^{n} x_{i}} \prod_{i=1}^{n} I_{(0, \infty)}\left(x_{i}\right)\end{split}\]
The parameter space, the Lambdas that are allowed are everything from 0 to infinity.
At this point, I can drop constants of proportionality. Again, I’m going to drop that indicator.
\[\begin{split}\text{A likelihood is} = L(\lambda)=\lambda^{n} e^{-\lambda \sum_{i=1}^{n} x_{i}} \\ \text{The log-likelihood is} = \ell(\lambda)=n \ln \lambda-\lambda \sum_{i=1}^{n} x_{i}\end{split}\]

Our goal is to maximize this as a function of Lambda.

\[\begin{split}\frac{\partial}{\partial \lambda} \ell(\lambda)=\frac{n}{\lambda}-\sum_{i=1}^{n} x_{i} \stackrel{\text { set }}{=} 0 \\ \lambda=\frac{\mathrm{n}}{\sum_{\mathrm{i}=1}^{\mathrm{n}} \mathrm{x}_{\mathrm{i}}}\end{split}\]

I want to make everything capital, and throw a hat on it. Here is our first continuous maximum likelihood estimator for Theta or Lambda.

The maximum likelihood estimator for \(\lambda\) is

\[\hat{\lambda}=\frac{n}{\sum_{i=1}^{n} X_{i}}=\frac{1}{\bar{X}}\]

Warning

Same as method of moments. Biased!

This is exactly what we got with method of moments. Because if Lambda is the rate of this distribution, the true distribution mean is 1 over Lambda. If you equate that to the sample mean x bar and solve for Lambda, in the method of moments case, we got 1 over x bar. We weren’t that happy about it because it was a biased estimator. I’m trying to convince you that MLEs are everything. But they’re not unbiased.

Normal distribution

MLEs for Multiple and Support Parameters

We’re going to to consider two cases

  • One is when theta is higher dimensional, so theta might be the vector of mu and sigma squared.

  • Other cases when the parameter is in the indicator.

\(X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} N(\mu, \sigma^2)\)

The pdf for one of them is \(\mathrm{f}\left(\mathrm{x} ; \mu, \sigma^{2}\right)=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \mathrm{e}^{-\frac{1}{2 \sigma^{2}}(\mathrm{x}-\mu)^{2}}\)
The joint pdf for all of them is
\[f(\vec{x} ; \mu, \sigma^{2})=\prod_{i=1}^{n} f\left(x_{i} ; \mu, \sigma^{2}\right) = \left(2 \pi \sigma^{2}\right)^{-\mathrm{n} / 2} \mathrm{e}^{-\frac{1}{2 \sigma^{2}} \sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{x}_{\mathrm{i}}-\mu\right)^{2}}\]

The parameter space : \(-\infty<\mu<\infty, \quad \sigma^{2}>0\)

\[\begin{split}\text{A likelihood is } \mathrm{L}\left(\mu, \sigma^{2}\right)=\left(2 \pi \sigma^{2}\right)^{-\mathrm{n} / 2} \mathrm{e}^{-\frac{1}{2 \sigma^{2}} \sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{x}_{\mathrm{i}}-\mu\right)^{2}} \\ \text{ The log-likelihood is } \ell\left(\mu, \sigma^{2}\right)=-\frac{\mathrm{n}}{2} \ln \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(\mathrm{x}_{\mathrm{i}}-\mu\right)^{2} \\ \ell\left(\mu, \sigma^{2}\right)=-\frac{\mathrm{n}}{2} \ln \left(2 \pi \sigma^{2}\right)-\frac{1}{2 \sigma^{2}} \sum_{\mathrm{i}=1}^{\mathrm{n}}\left(\mathrm{x}_{\mathrm{i}}-\mu\right)^{2} \\ \frac{\partial}{\partial \mu} \ell\left(\mu, \sigma^{2}\right) \stackrel{\text { set }}{=} 0 \\ \frac{\partial}{\partial \sigma^{2}} \ell\left(\mu, \sigma^{2}\right) \stackrel{\text { set }}{=} 0\end{split}\]
Solve for ΞΌ and Οƒ simultaneously
https://cdn.mathpix.com/snip/images/vNkeYyOT1UmgFNCA2sgFKAfNXR5IMAPfXh5GmBqIgwc.original.fullsize.png https://cdn.mathpix.com/snip/images/UMjzkiqAodLdttR_5myNWdEQ-HVAKsYKEJaS1ZH1lkM.original.fullsize.png https://cdn.mathpix.com/snip/images/McLGaebTrvxQ71PE5jIkBWXHiP7uoZpPqKafcSi8K2U.original.fullsize.png https://cdn.mathpix.com/snip/images/YHRDjDtDGA28tUpQZovCDOui_42Fx4plVy2bfjWCTNM.original.fullsize.png https://cdn.mathpix.com/snip/images/PWANXAiviLgD1ZBLjBdsMxLrThZn7UDX4olqvNkDmY0.original.fullsize.png

The Invariance Property

Evaluation

Comparing the quality of different estimators

https://cdn.mathpix.com/snip/images/l1hmVbsC3u0wIx-GYZf_d9OeV9ZiODfWQkbw5ULVyc4.original.fullsize.png
Variance, MSE, and Bias

Note

If \(\hat{\theta}\) is an unbiased estimator of \(\theta\), its mean squared error is simply the variance of \(\theta\)

\[\begin{split}MSE(\widehat{\theta})=E\left[(\widehat{\theta}-\theta)^{2}\right] \\ =E\left[(\widehat{\theta}-E[\hat{\theta}]+E[\hat{\theta}]-\theta)^{2}\right] \\ =E\left[((\hat{\theta}-E[\hat{\theta}])+B[\hat{\theta}])^{2}\right] \\ \\ MSE(\hat{\theta})=Var[\hat{\theta}]+(B[\hat{\theta}])^{2}\end{split}\]
Practise

Let \(X_{1}, X_{2}, \ldots, X_{n}\) be a random sample from the Poisson distribution with parameter \(\lambda>0\). Let \(\hat{\theta}\) be the MLE for \(\lambda\). What is the mean-squared error of \(\widehat{\lambda}\) as an estimator of \(\lambda\) ?

\[\begin{split}x \sim \text { poisson ( } \lambda \text { ) and MLE of } \lambda=\bar{X} \\ \text{ Proof } \\ E(\bar{X}) = E[\sum_{i=1}^{n} X_{i}] = \frac{1}{n} E\left[\sum_{n=1}^{n} x_{i}\right] \\ E(\bar{X}) = \frac{n}{n} E\left[X_{i}\right] = \lambda \\ \\ MSE(\hat{x})=Var(\hat{\lambda})+(Bias(\lambda))^{2} \\ = Var(\hat{\lambda})+ 0 \\ = Var (\frac{1}{n} \sum_{i=1}^{n} x_{i}) \\ = \frac{n}{n^{2}} \times \lambda = \frac{\lambda}{n}\end{split}\]

MLE Properties

Confidence Interval

A 95% confidence for the mean ΞΌ is given by (-2.14,3.07).

There’s no probability, where does the probability come in?

it comes in from random sampling. The fact that if you select a random sample and compute your estimator and if you got to do it again or if someone else did it, they’re going to get a different estimator based on the randomness of the sample.

  • Collect your sample.

  • Estimate the parameter.

  • Return a confidence interval.

If you did this again, you would not get the same results! Different samples and different confidence intervals.

https://cdn.mathpix.com/snip/images/FisnG71KAXfz6WNf4z8lfNR14egqOwTwicHsEP1aHEs.original.fullsize.png

Multiple samples give multiple confidence intervals. 95% of them will correctly capture ΞΌ and 5% miss it.

Hypothesis Testing

Statistical inference is the process of learning about characteristics of a population based on what is observed in a relatively small sample from that population. A sample will never give us the entire picture though, and we are bound to make incorrect decisions from time to time.

We will learn how to derive and interpret appropriate tests to manage this error and how to evaluate when one test is better than another. we will learn how to construct and perform principled hypothesis tests for a wide range of problems and applications when they are not.

What is Hypothesis

  • Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter.

  • Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

Note

Due to random samples and randomness in the problem, we can different errors in our hypothesis testing. These errors are called Type I and Type II errors.

Type of hypothesis testing

Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and variance \(\sigma^2\)

import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm


sns.set_theme(style="darkgrid")
sample = torch.normal(mean = 0, std = 1, size=(1,1000))

sns.displot(sample[0], kde=True, stat = 'density',)
plt.axvline(torch.mean(sample[0]), color='red', label='mean')
plt.show()
_images/0f8c2dfcaace7cedb154cfc7b9f47ce6ccdf2ee5a8499bc1997dca1becc7a915.png

Example of random sample after it is observed:

\[ 2.73,1.14,3.98,2.15,5,85,1.97,2.54,2.03 \]

Based on what you are seeing, do you believe that the true population mean \(\mu\) is

\[\begin{split} \mu<=3 \text{ or } \mu>3 \\ \text { The sample mean is } \overline{\mathrm{x}}=2.799 \end{split}\]

This is below 3 , but can we say that \(\mu<3\) ?

This seems awfully dependent on the random sample we happened to get! Let’s try to work with the most generic random sample of size 8:

\[ X_1, X_2, X_3, X_4, X_5, X_6, X_7, X_8 \]

Let \(\mathrm{X}_1, \mathrm{X}_2, \ldots, \mathrm{X}_{\mathrm{n}}\) be a random sample of size \(\mathrm{n}\) from the \(\mathrm{N}\left(\mu, \sigma^2\right)\) distribution.

\[ \mathrm{X}_1, \mathrm{X}_2, \ldots, \mathrm{X}_{\mathrm{n}} \stackrel{\text { iid }}{\sim} \mathrm{N}\left(\mu, \sigma^2\right) \]

The Sample mean is

\[ \bar{x}=\frac{1}{n} \sum_{i=11}^n X_i \]
  • We’re going to tend to think that \(\mu<3\) when \(\bar{X}\) is β€œsignificantly” smaller than 3.

  • We’re going to tend to think that \(\mu>3\) when \(\bar{X}\) is β€œsignificantly” larger than 3.

  • We’re never going to observe \(\bar{X}=3\), but we may be able to be convinced that \(\mu=3\) if \(\bar{X}\) is not too far away.

How do we formalize this stuff, We use hypothesis testing

Notation

\(\mathrm{H}_0: \mu \leq 3\) <- Null hypothesis
\(\mathrm{H}_1: \mu>3 \quad\) Alternate hypothesis

Null hypothesis

The null hypothesis is a hypothesis that is assumed to be true. We denote it with an \(H_0\).

Alternate hypothesis

The alternate hypothesis is what we are out to show. The alternative hypothesis is a hypothesis that we are looking for evidence for or out to show. We denote it with an \(H_1\).

Note

Some people use the notation \(H_a\) here

Conclusion is either:
Reject \(\mathrm{H}_0 \quad\) OR \(\quad\) Fail to Reject \(\mathrm{H}_0\)

simple hypothesis

A simple hypothesis is one that completely specifies the distribution. Do you know the exact distribution.

composite hypothesis

You don’t know the exact distribution.
Means you know the distribution is normal but you don’t know the mean and variance.

Critical values

Critical values for distributions are numbers that cut off specified areas under pdfs. For the N(0, 1) distribution, we will use the notation \(z_\alpha\) to denote the value that cuts off area \(\alpha\) to the right as depicted here.

Critical values in Hypothesis Testing Critical values example

Errors in Hypothesis Testing

Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and variance \(\sigma^2=2\)

\[ H _0: \mu \leq 3 \quad H _1: \mu>3 \]

Idea: Look at \(\bar{X}\) and reject \(H_0\) in favor of \(H _1\) if \(\overline{ X }\) is β€œlarge”.
i.e. Look at \(\bar{X}\) and reject \(H_0\) in favor of \(H _1\) if \(\overline{ X }> c\) for some value \(c\).

Errors in Hypothesis Testing Errors in Hypothesis Testing

You are a potato chip manufacturer and you want to ensure that the mean amount in 15 ounce bags is at least 15 ounces. \(\mathrm{H}_0: \mu \leq 15 \quad \mathrm{H}_1: \mu>15\)

Type I Error

The true mean is \(\leq 15\) but you concluded i was \(>15\). You are going to save some money because you won’t be adding chips but you are risking a lawsuit!

Type II Error

The true mean is \(> 15\) but you concluded it was \(\leq 15\) . You are going to be spending money increasing the amount of chips when you didn’t have to.

Developing a Test

Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\).

Consider testing the simple versus simple hypotheses

\[\begin{split} \begin{aligned} & H _0: \mu=5 \\ & H _1: \mu=3 \end{aligned} \end{split}\]
level of significance

Let \(\alpha= P\) (Type I Error)
\(= P \left(\right.\) Reject \(H _0\) when it’s true \()\)
\(= P \left(\right.\) Reject \(H _0\) when \(\left.\mu=5\right)\)

\(\alpha\) is called the level of significance of the test. It is also sometimes referred to as the size of the test.

\[\begin{split} \begin{aligned} \alpha &=\max P (\text { Type I Error }) \\ &=\max _{\mu \in H _0} P \left(\text { Reject } H _0 ; \mu\right) \\ \beta &=\max P (\text { Type II Error }) \\ &=\max _{\mu \in H _1} P \left(\text { Fail to Reject } H _0 ; \mu\right) \end{aligned} \end{split}\]
Power of the test

\(1-\beta\) is known as the power of the test

\[\begin{split} \begin{gathered} 1-\beta=1-\max _{\mu \in H _1} P \left(\text { Fail to Reject } H _0 ; \mu\right) \\ =\min _{\mu \in H _1}\left(1- P \left(\text { Fail to Reject } H _0 ; \mu\right)\right) \\ =\min _{\mu \in H _1} P \left(\text { Reject } H _0 ; \mu\right) \quad \begin{array}{c} \text { High power } \\ \text { is good! } \end{array} \end{gathered} \end{split}\]
Step One

Choose an estimator for ΞΌ.

\[ \widehat{\mu}=\bar{X} \]
Step Two

Choose a test statistic or Give the β€œform” of the test.

  • We are looking for evidence that \(H _1\) is true.

  • The \(N \left(3, \sigma^2\right)\) distribution takes on values from \(-\infty\) to \(\infty\).

  • \(\overline{ X } \sim N \left(\mu, \sigma^2 / n \right) \Rightarrow \overline{ X }\) also takes on values from \(-\infty\) to \(\infty\).

  • It is entirely possible that \(\bar{X}\) is very large even if the mean of its distribution is 3.

  • However, if \(\bar{X}\) is very large, it will start to seem more likely that \(\mu\) is larger than 3.

  • Eventually, a population mean of 5 will seem more likely than a population mean of 3.

Reject \(H _0\), in favor of \(H _1\), if \(\overline{ X }< c\) for some c to be determined.

Step Three

Find c.

  • If \(c\) is too large, we are making it difficult to reject \(H _0\). We are more likely to fail to reject when it should be rejected.

  • If \(c\) is too small, we are making it to easy to reject \(H _0\). We are more likely reject when it should not be rejected.

This is where \(\alpha\) comes in.

\[\begin{split} \alpha&= P(Type I Error) \\ &=P( \text{Reject } H_0 \text{ when true}) \\ &=P (\overline{ X }< c \text{ when } \mu=3) \end{split}\]
Step Four

Give a conclusion!

\(0.05= P (\) Type I Error)
\(= P \left(\right.\) Reject \(H _0\) when true \()\)
\(= P (\overline{ X }< \text{ c when } \mu=5)\)

\( = P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}<\frac{ c -5}{2 / \sqrt{10}}\right.\) when \(\left.\mu=5\right)\)

Errors in Hypothesis Testing Errors in Hypothesis Testing Errors in Hypothesis Testing
Formula

Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\).

Consider testing the simple versus simple hypotheses

\[ H _0: \mu=\mu_0 \quad H _1: \mu=\mu_1 \]

where \(\mu_0\) and \(\mu_1\) are fixed and known.

\[ \begin{align}\begin{aligned}\begin{split} H_0: \mu=\mu_0 \\ H _1: \mu=\mu_1 \\ \mu_0<\mu_1 \\ \text{ Reject H0, in favor of H1 if } \\\end{split}\\\large \overline{ X }>\mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }} \end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}\begin{split} H_0: \mu=\mu_0 \\ H _1: \mu=\mu_1 \\ \mu_0>\mu_1 \\ \text{ Reject H0, in favor of H1 if } \\\end{split}\\\large \overline{ X }<\mu_0+ z_{1-\alpha} \frac{\sigma}{\sqrt{ n }} \end{aligned}\end{align} \]
Type II Error
\[\begin{split} H_0: \mu=\mu_0 \\ H _1: \mu=\mu_1 \\ \mu_0<\mu_1 \end{split}\]
\[\begin{split} \begin{aligned} & \beta= P (\text { Type II Error }) \\ =& P \left(\text { Fail to Reject } H _0 \text { when false }\right) \\ =& P \left(\overline{ X } \leq \mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }} \text { when } \mu=\mu_1\right) \\ =& P \left(\overline{ X } \leq \mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }} ; \mu_1\right) \end{aligned}\end{split}\]
\[\begin{split} \begin{aligned} \beta &= P \left(\left(\frac{\overline{X} -\mu_1}{\sigma / \sqrt{ n }}\right) \leq \frac{\mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }}-\mu_1}{\sigma / \sqrt{ n }} ; \mu_1\right) \\ &= P \left( Z \leq \frac{\mu_0+ z _\alpha \frac{\sigma}{\sqrt{ n }}-\mu_1}{\sigma / \sqrt{ n }}\right) \end{aligned} \end{split}\]

Composite vs Composite Hypothesis

\[\begin{split} \begin{aligned} & X _1, X _2, \ldots, X _{ n } \sim N \left(\mu, \sigma^2\right), \sigma^2 \text { known } \\ & H _0: \mu \leq \mu_0 \quad \text { vs } \quad H _1: \mu>\mu_0 \end{aligned} \end{split}\]
  • Step One Choose an estimator for ΞΌ

  • Step Two Choose a test statistic: Reject \(H_0\) , in favor of \(H_1\) if \(\bar{𝖷}\) > c, where c is to be determined.

  • Step Three Find c.

One-Tailed Tests

Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\). Consider testing the hypotheses

\[ H _0: \mu \geq \mu_0 \quad H _1: \mu<\mu_0 \]

where \(\mu_0\) is fixed and known.

Step One

Choose an estimator for ΞΌ.

\[ \widehat{\mu}=\bar{X} \]
Step Two

Choose a test statistic or Give the β€œform” of the test.

Reject \(H _0\), in favor of \(H _1\), if \(\overline{ X }< c\) for some c to be determined.

Step Three

Find c.

\[\begin{split} \begin{aligned} \alpha &=\max _{\mu \geq \mu_0} P (\text { Type I Error }) \\ &=\max _{\mu \geq \mu_0} P \left(\text { Reject } H _0 ; \mu\right) \\ &=\max _{\mu \geq \mu_0} P (\overline{ X }< c ; \mu) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} \alpha &=\max _{\mu \geq \mu_0} P (\overline{ X }< c ; \mu) \\ &=\max _{\mu \geq \mu_0} P \left( Z <\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) \\ &=\max _{\mu \geq \mu_0} \Phi\left(\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} \alpha &=\max _{\mu \geq \mu_0} P (\overline{ X }< c ; \mu) \\ &=\max _{\mu \geq \mu_0} P \left( Z <\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) \\ &=\max _{\mu \geq \mu_0} \Phi\left(\frac{ c -\mu}{\sigma / \sqrt{ n }}\right) \\ \text { decreasing in } \mu \end{aligned} \end{split}\]
Step four

Reject \(H _0\), in favor of \(H _1\), if $\( \overline{ X }<\mu_0+ z _{1-\alpha} \frac{\sigma}{\sqrt{ n }} \)$

Example

In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be \(\$ 6,015\).

In a more recent survey, 100 randomly sampled families of 4 reported an average annual health care premium of \(\$ 6,537\). Can we say that the true average is currently greater than \(\$ 6,015\) for all families of 4?

Assume that annual health care premiums are normally distributed with a standard deviation of \(\$ 814\). Let \(\mu\) be the true average for all families of 4.

Step Zero

Set up the hypotheses.

\[ H _0: \mu=6015 \quad H _1: \mu>6015 \]

Decide on a level of significance. \( \alpha=0.10\)

Step One

Choose an estimator for \(\mu\).

\[ \hat{\mu}=\bar{X} \]
Step Two

Give the form of the test. Reject \(H _0\), in favor of \(H _1\), if

\[ \bar{X}>c \]

for some \(c\) to be determined.

Step Three

Find c.

\[\begin{split} \begin{aligned} \alpha &=\max _{\mu=\mu_0} P (\text { Type I Error; } \mu) \\ &= P \left(\text { Type I Error; } \mu_0\right) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &\alpha= P \left(\text { Reject } H _0 ; \mu_0\right) \text { when }\\ &= P \left(\overline{ X }> c ; \mu_0\right) \quad \text { it true!, }\\ &= P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}>\frac{ c -6015}{814 / \sqrt{100}} ; \mu_0\right)\\ &=P\left(Z>\frac{c-6015}{814 / \sqrt{100}}\right) \end{aligned} \end{split}\]
\[ \frac{c-6015}{814 / \sqrt{100}}=1.28 \]
Step Four

Conclusion. Reject \(H _0\), in favor of \(H _1\), if

\[ \bar{X}>6119.19 \]

From the data, where \(\bar{x}=6537\), we reject \(H _0\) in favor of \(H _1\).
The data suggests that the true mean annual health care premium is greater than \(\$ 6015\).

Hypothesis Testing with P-Values

Recall that p-values are defined as the following: A p-value is the probability that we observe a test statistic at least as extreme as the one we calculated, assuming the null hypothesis is true. It isn’t immediately obvious what that definition means, so let’s look at some examples to really get an idea of what p-values are, and how they work.

Let’s start very simple and say we have 5 data points: x = <1, 2, 3, 4, 5>. Let’s also assume the data were generated from some normal distribution with a known variance \(\sigma\) but an unknown mean \(\mu_0\). What would be a good guess for the true mean? We know that this data could come from any normal distribution, so let’s make two wild guesses:

  1. The true mean is 100.

  2. The true mean is 3.

Intuitively, we know that 3 is the better guess. But how do we actually determine which of these guesses is more likely? By looking at the data and asking β€œhow likely was the data to occur, assuming the guess is true?”

  1. What is the probability that we observed x=<1,2,3,4,5> assuming the mean is 100? Probabiliy pretty low. And because the p-value is low, we β€œreject the null hypothesis” that \(\mu_0 = 100\).

  2. What is the probability that we observed x=<1,2,3,4,5> assuming the mean is 3? Seems reasonable. However, something to be careful of is that p-values do not prove anything. Just because it is probable for the true mean to be 3, does not mean we know the true mean is 3. If we have a high p-value, we β€œfail to reject the null hypothesis” that \(\mu_0 = 3\).

What do β€œlow” and β€œhigh” mean? That is where your significance level \(\alpha\) comes back into play. We consider a p-value low if the p-value is less than \(\alpha\), and high if it is greater than \(\alpha\).

Example

From the above example.

Errors in Hypothesis Testing Errors in Hypothesis Testing
  • This is the \(N\left(6015,814^2 / 100\right)\) pdf.

  • The red area is \(P (\overline{ X }>6537)\).

\[\begin{split} \begin{aligned} & P (\overline{ X }>6537) \\ &\quad= P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}>\frac{6537-6015}{814 / \sqrt{100}}\right) \\ &= P ( Z >6.4127) \\ &\approx 0.00000001 \quad \begin{array}{l} \text { Super small } \\ \text { and way out } \\ \text { "in the tail". } \end{array} \end{aligned} \end{split}\]
  • The P-Value is the area to the right (in this case) of the test statistic \(\bar{X}\).

  • The P-value being less than \(0.10\) puts \(\bar{X}\) in the rejection region.

  • The P-value is also less than \(0.05\) and \(0.01\).

  • It looks like we will reject \(H _0\) for the most typical values of \(\alpha\).

Power Functions

Let \(X_1, X_2, \ldots, X_n\) be a random sample from any distribution with unknown parameter \(\theta\) which takes values in a parameter space \(\Theta\)

We ultimately want to test

\[\begin{split} \begin{aligned} & H _0: \theta \in \Theta_0 \\ & H _1: \theta \in \Theta \backslash \Theta_0 \end{aligned} \end{split}\]

where \(\Theta_0\) is some subset of \(\Theta\).

So in other words, if the null hypothesis was for you to test for an exponential distribution, whether lambda was between 0 and 2, the complement of that is not the rest of the real number line because the space is only non-negative values. So the complement of the interval from 0 to 2 in that space is 2 to infinity.

\(\gamma(\theta)= P \left(\right.\) Reject \(H _0\) when the parameter is \(\left.\theta\right)\)

\[ \gamma(\theta)= P \left(\text { Reject } H _0 ; \theta\right) \]

\(\theta\) is an argument that can be anywhere in the parameter space \(\Theta\). it could be a \(\theta\) from \(H _0\) it could be a \(\theta\) from \(H _1\)

\[\begin{split} \begin{aligned} &\alpha=\max P \left(\text { Reject } H _0 \text { when true }\right) \\ &=\max _{\theta \in \Theta_0} P \left(\text { Reject } H _0 ; \theta\right) \\ &=\max _{\theta \in \Theta_0} \gamma(\theta) \longleftrightarrow \begin{array}{l} \text { Other notation } \\ \text { is } \max _{\theta \in H _0} \\ \hline \end{array} \\ & \end{aligned} \end{split}\]

Two Tailed Tests

Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\).

Derive a hypothesis test of size \(\alpha\) for testing

\[\begin{split} \begin{aligned} & H _0: \mu=\mu_0 \\ & H _1: \mu \neq \mu_0 \end{aligned} \end{split}\]

We will look at the sample mean \(\bar{X} \ldots\) \(\ldots\) and reject if it is either too high or too low.

Step One

Choose an estimator for ΞΌ.

\[ \widehat{\mu}=\bar{X} \]
Step Two

Choose a test statistic or Give the β€œform” of the test.

Reject \(H _0\), in favor of \(H _1\) if either \(\overline{ X }< c\) or \(\bar{X}>d\) for some \(c\) and \(d\) to be determined.

Easier to make it symmetric! Reject \(H _0\), in favor of \(H _1\) if either

\[\begin{split} \begin{aligned} &\overline{ X }>\mu_0+ c \\ &\overline{ X }<\mu_0- c \end{aligned} \end{split}\]

for some \(c\) to be determined.

Step Three

Find c.

\[\begin{split} \begin{aligned} \alpha &=\max _{\mu=\mu_0} P (\text { Type I Error }) \\ &=\max _{\mu=\mu_0} P \left(\text { Reject } H _0 ; \mu\right) \\ &= P \left(\text { Reject } H _0 ; \mu_0\right) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &\alpha= P \left(\overline{ X }<\mu_0- c \text { or } \overline{ X }>\mu_0+ c ; \mu_0\right) \\ &=1- P \left(\mu_0- c \leq \overline{ X } \leq \mu_0+ c ; \mu_0\right) \end{aligned} \end{split}\]
\[\begin{split} \begin{gathered} \alpha=1- P \left(\frac{- c }{\sigma / \sqrt{ n }} \leq Z \leq \frac{ c }{\sigma / \sqrt{ n }}\right) \\ 1-\alpha= P \left(\frac{- c }{\sigma / \sqrt{ n }} \leq Z \leq \frac{ c }{\sigma / \sqrt{ n }}\right) \end{gathered} \end{split}\]
\[ \begin{align}\begin{aligned} \frac{c}{\sigma / \sqrt{n}}=z_{\alpha / 2}\\ c=z_{\alpha / 2} \frac{\sigma}{\sqrt{n}} \end{aligned}\end{align} \]
Errors in Hypothesis Testing
Step Four

Conclusion

Reject \(H _0\), in favor of \(H _1\), if

\[\begin{split} \begin{aligned} &\overline{ X }>\mu_0+ z _{\alpha / 2} \frac{\sigma}{\sqrt{n}} \\ &\overline{ X }<\mu_0- z _{\alpha / 2} \frac{\sigma}{\sqrt{ n }} \end{aligned} \end{split}\]
Example

In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be \(\$ 6,015\).

In a more recent survey, 100 randomly sampled families of 4 reported an average annual health care premium of \(\$ 6,177\). Can we say that the true average, for all families of 4 , is currently different than the sample mean from 2019? $\( \sigma=814 \quad \text { Use } \alpha=0.05 \)$

Assume that annual health care premiums are normally distributed with a standard deviation of \(\$ 814\). Let \(\mu\) be the true average for all families of 4. Hypotheses:

\[\begin{split} \begin{aligned} & H _0: \mu=6015 \\ & H _1: \mu \neq 6015 \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &\bar{x}=6177 \quad \sigma=814 \quad n=100 \\ &z_{\alpha / 2}=z_{0.025}=1.96 \\ &\text { In R: qnorm(0.975) } \\ &6015+1.96 \frac{814}{\sqrt{100}}=6174.5 \\ &6015-1.96 \frac{814}{\sqrt{100}}=5855.5 \end{aligned} \end{split}\]

We reject \(H _0\), in favor of \(H _1\). The data suggests that the true current average, for all families of 4 , is different than it was in 2019.

Errors in Hypothesis Testing

Hypothesis Tests for Proportions

A random sample of 500 people in a certain country which is about to have a national election were asked whether they preferred β€œCandidate A” or β€œCandidate B”. From this sample, 320 people responded that they preferred Candidate A.

Let \(p\) be the true proportion of the people in the country who prefer Candidate A.

Test the hypotheses \(H _0: p \leq 0.65\) versus \(H _1: p>0.65\) Use level of significance \(0.10\). We have an estimate

\[ \hat{p}=\frac{320}{500}=\frac{16}{25} \]
The Model

Take a random sample of size \(n\). Record \(X_1, X_2, \ldots, X_n\) where \(X_i= \begin{cases}1 & \text { person i likes Candidate A } \\ 0 & \text { person i likes Candidate B }\end{cases}\) Then \(X_1, X_2, \ldots, X_n\) is a random sample from the Bernoulli distribution with parameter \(p\).

Note that, with these 1’s and 0’s, $\( \begin{aligned} \hat{p} &=\frac{\# \text { in the sample who like A }}{\# \text { in the sample }} \\ &=\frac{\sum_{ i =1}^{ n } X _{ i }}{ n }=\overline{ X } \end{aligned} \)\( By the Central Limit Theorem, \)\hat{p}=\overline{ X }$ has, for large samples, an approximately normal distribution.

\[\begin{split} \begin{aligned} E[\hat{p}] &=E\left[X_1\right]=p \\ \operatorname{Var}[\hat{p}] &=\frac{\operatorname{Var}\left[X_1\right]}{n}=\frac{p(1-p)}{n} \end{aligned} \end{split}\]

So, \(\quad \hat{p} \stackrel{\text { approx }}{\sim} N\left(p, \frac{p(1-p)}{n}\right)\)

\[ \hat{p} \stackrel{\text { approx }}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \]

In particular, $\( \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}} \)\( behaves roughly like a \)N(0,1)\( as \)n$ gets large.

\(n >30\) is a rule of thumb to apply to all distributions, but we can (and should!) do better with specific distributions.

  • \(\hat{p}\) lives between 0 and 1.

  • The normal distribution lives between \(-\infty\) and \(\infty\).

  • However, \(99.7 \%\) of the area under a \(N(0,1)\) curve lies between \(-3\) and 3 ,

\[\begin{split} \begin{aligned} &\hat{p} \stackrel{\text { approx }}{\sim} N\left(p, \frac{p(1-p)}{n}\right) \\ &\Rightarrow \sigma_{\hat{p}}=\sqrt{\frac{p(1-p)}{n}} \end{aligned} \end{split}\]

Go forward using normality if the interval $\( \left(\hat{p}-3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) \)\( is completely contained within \)[0,1]$.

Step One

Choose a statistic. \(\widehat{p}=\) sample proportion for Candidate \(A\)

Step Two

Form of the test. Reject \(H _0\), in favor of \(H _1\), if \(\hat{ p }> c\).

Step Three

Use \(\alpha\) to find \(c\) Assume normality of \(\hat{p}\) ? It is a sample mean and \(n>30\).

  • The interval $\( \left(\hat{p}-3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) \)\( is \)(0.5756,0.7044)$

\[\begin{split} \begin{aligned} \alpha &=\max _{p \in H_0} P (\text { Type I Error }) \\ &=\max _{p \leq 0.65} P \left(\text { Reject } H _0 ; p \right) \\ &=\max _{ p \leq 0.65} P (\hat{ p }> c ; p ) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} \alpha &=\max _{p \leq 0.65} P\left(\frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}>\frac{c-p}{\sqrt{\frac{p(1-p)}{n}}} ; p\right) \\ & \approx \max _{p \leq 0.65} P\left(Z>\frac{c-p}{\sqrt{\frac{p(1-p)}{n}}}\right) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} 0.10 &=\max _{p \leq 0.65} P \left(Z>\frac{c-p}{\sqrt{\frac{p(1-p)}{n}}}\right) \\ &=P\left(Z>\frac{c-0.65}{\sqrt{\frac{0.65(1-0.65)}{n}}}\right) \\ & \Rightarrow \frac{c-0.65}{\sqrt{\frac{0.65(1-0.65)}{n}}}=z_{0.10} \end{aligned} \end{split}\]

Reject \(H _0\) if

\[ \hat{p}>0.65+z_{0.10} \sqrt{\frac{0.65(1-0.65)}{n}} \]

Formula

\[ \hat{p}> p +z_{0.10} \sqrt{\frac{p(1-p)}{n}} \]

T-Tests

What is a t-test, and when do we use it? A t-test is used to compare the means of one or two samples, when the underlying population parameters of those samples (mean and standard deviation) are unknown. Like a z-test, the t-test assumes that the sample follows a normal distribution. In particular, this test is useful for when we have a small sample size, as we can not use the Central Limit Theorem to use a z-test.

There are two kinds of t-tests:

  1. One Sample t-tests

  2. Two Sample t-tests

Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and unknown variance \(\sigma^2\).

Consider testing the simple versus simple hypotheses $\( H _0: \mu=\mu_0 \quad H _1: \mu<\mu_0 \)\( where \)\mu_0$ is fixed and known.

Reject \(H _0\), in favor of \(H _1\), if

\[ \overline{ X }<\mu_0+ z _{1-\alpha} \frac{\sigma}{\sqrt{ n }} \]

unknown!This is a useless test!

It was based on the fact that

\[ \begin{align}\begin{aligned} \overline{ X } \sim N \left(\mu, \sigma^2 / n \right)\\\begin{split}\\\end{split}\\\frac{\overline{ X }-\mu}{\sigma / \sqrt{ n }} \sim N (0,1) \end{aligned}\end{align} \]

What is we use the sample standard deviation \(S =\sqrt{ S ^2}\) in place of \(\sigma\) ?

\[\begin{split} \begin{aligned} \frac{\overline{ X }-\mu}{ S / \sqrt{ n }} &=\frac{\overline{ X }-\mu}{\sigma / \sqrt{ n }} \cdot \frac{\sigma}{ S }=\frac{\frac{\overline{ X }-\mu}{\sigma / \sqrt{ n }}}{\frac{ S }{\sigma}} \\ &=\frac{\overline{ X }-\mu}{\sigma / \sqrt{ n }} / \sqrt{\frac{ S ^2}{\sigma^2}} \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &\frac{\overline{ X }-\mu}{ S / \sqrt{ n }}=\frac{\overline{ X }-\mu}{\sigma / \sqrt{ n }} / \sqrt{\frac{ S ^2}{\sigma^2}} \\ &=\left(\frac{ X -\mu}{\sigma / \sqrt{ n }}\right) / \sqrt{\frac{\left(\frac{( n -1) S ^2}{\sigma^2}\right.}{ n -1}} \chi^2( n -1) \\ & N (0,1) \\ & \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &\frac{\overline{ X }-\mu}{ S / \sqrt{ n }}=\frac{\overline{ X }-\mu}{\sigma / \sqrt{ n }} / \sqrt{\frac{ S ^2}{\sigma^2}} \\ &=\left(\frac{ X -\mu}{\sigma / \sqrt{ n }}\right) / \sqrt{\frac{\left(\frac{( n -1) S ^2}{\sigma^2}\right.}{ n -1}} \chi^2( n -1) \\ & N (0,1) \\ & \end{aligned} \end{split}\]

Thus,

\[ \frac{\bar{X}-\mu}{S / \sqrt{n}} \sim t(n-1) \]
Step four

Conclusion! Reject \(H _0\), in favor of \(H _1\), if

\[ \overline{ X }<\mu_0+ t _{1-\alpha, n -1} \frac{ S }{\sqrt{ n }} \]
Example

In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be \(\$ 6,015\).

In a more recent survey, 15 randomly sampled families of 4 reported an average annual health care premium of \(\$ 6,033\) and a sample variance of \(\$ 825\).

Can we say that the true average is currently greater than \(\$ 6,015\) for all families of 4 ?

Use \(\alpha=0.10\)

Assume that annual health care premiums are normally distributed. Let \(\mu\) be the true average for all families of 4.

Step Zero

Set up the hypotheses.

\[ H _0: \mu=6015 \quad H _1: \mu>6015 \]
Step One

Choose a test statistic

\[ \bar{X} \]
Step Two

Give the form of the test. Reject 𝖧0 , in favor of h1, if 𝟒 π–§πŸ£ 𝖷 > 𝖼 where c is to be determined.

Step Three

Find c

\[\begin{split} \begin{aligned} \alpha &=\max _{\mu=\mu_0} P (\text { Type I Error }) \\ &=\max _{\mu=6015} P \left(\text { Reject } H _0 ; \mu\right) \\ &= P \left(\text { Reject } H _0 ; \mu=6015\right) \\ &= P (\overline{ X }> c ; \mu=6015) \end{aligned} \end{split}\]
\[\begin{split} \begin{gathered} \alpha= P (\overline{ X }> c ; \mu=6015) \\ = P \left(\frac{\overline{ X }-\mu_0}{ S / \sqrt{ n }}>\frac{ c -6015}{\sqrt{825} / \sqrt{15}} ; \mu=6015\right) \\ = P \left( T >\frac{ c -6015}{\sqrt{825} / \sqrt{15}}\right) \end{gathered} \end{split}\]
T test
\[\begin{split} \begin{aligned} &\Rightarrow \frac{c-6015}{\sqrt{825} / \sqrt{15}}=1.345 \\ &\Rightarrow c=6024.98 \end{aligned} \end{split}\]
Step Four

Conclusion. Rejection Rule: Reject \(H _0\), in favor of \(H _1\) if

\[ \bar{X}>6024.98 \]

We had \(\bar{x}=6033\) so we reject \(H_0\).

There is sufficient evidence (at level \(0.10\) ) in the data to suggest that the true mean annual healthcare premium cost for a family of 4 is greater than \(\$ 6,015\).

P value
\[\begin{split} \begin{aligned} &\text { P-Value }= P (\overline{ X }>6033 ; \mu=6015) \\ &= P \left(\frac{\overline{ X }-\mu}{ S / \sqrt{ n }}>\frac{6033-6015}{\sqrt{825} / \sqrt{15}} ; \mu=6015\right) \\ &= P ( T >2.43) \approx 0.015 \\ &\quad \text { where } T \sim t (14) \\ &(\operatorname{In~R}: 1- pt (2.43,14) \end{aligned} \end{split}\]

Two Sample Tests for Means

Fifth grade students from two neighboring counties took a placement exam.

Group 1, from County 1, consisted of 57 students. The sample mean score for these students was \(7 7 . 2\) and the true variance is known to be 15.3. Group 2, from County 2, consisted of 63 students and had a sample mean score of \(75.3\) and the true variance is known to be 19.7.

From previous years of data, it is believed that the scores for both counties are normally distributed.

Derive a test to determine whether or not the two population means are the same.

\[\begin{split} \begin{aligned} & H _0: \mu_1=\mu_2 \\ & H _1: \mu_1 \neq \mu_2 \end{aligned} \end{split}\]

Suppose that \(X _{1,1}, X _{1,2}, \ldots, X _{1, n _1}\) is a random sample of size \(n_1\) from the normal distribution with mean \(\mu_1\) and variance \(\sigma_1^2\). Suppose that \(X_{2,1}, X_{2,2}, \ldots, X_{2, n_2}\) is a random sample of size \(n_2\) from the normal distribution with mean \(\mu_2\) and variance \(\sigma_2^2\).

  • Suppose that \(\sigma_1^2\) and \(\sigma_2^2\) are known and that the samples are independent.

\[\begin{split} \begin{aligned} & H _0: \mu_1=\mu_2 \quad H _1: \mu_1 \neq \mu_2 \\ & H _0: \mu_1-\mu_2=0 \\ & H _1: \mu_1-\mu_2 \neq 0 \\ & \end{aligned} \end{split}\]

Think of this as $\( \begin{gathered} \theta=0 \text { versus } \theta \neq 0 \\ \text { for } \\ \theta=\mu_1-\mu_2 \end{gathered} \)$

Step one

Choose an estimator for \(\theta=\mu_1-\mu_2\)

\[ \hat{\theta}=\bar{X}_1-\bar{X}_2 \]
Step Two

Give the β€œform” of the test. Reject \(H _0\), in favor of \(H _1\) if either \(\hat{\theta}>c\) or \(\hat{\theta}<-c\) for some c to be determined.

Step Three

Find \(c\) using \(\alpha\) Will be working with the random variable

\[ \bar{X}_1-\bar{x}_2 \]

We need to know its distribution…

\[ \bar{X}_1-\bar{x}_2 \text{ is normally distributed.} \]
Step Three

Find c using \(\alpha\).

\(\bar{X}_1-\bar{X}_2\) is normally distributed

\[\begin{split} \begin{aligned} &\overline{ X }_1-\overline{ X }_2 \sim N \left(\mu_1-\mu_2, \frac{\sigma_1^2}{ n _1}+\frac{\sigma_2^2}{ n _1}\right) \\ & Z =\frac{\overline{ X }_1-\overline{ X }_2-\left(\mu_1-\mu_2\right)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \sim N (0,1) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &\alpha= P (\text { Type I Error }) \\ &\quad= P \left(\text { Reject } H _0 ; \theta=0\right) \\ &= P \left(\overline{ X }_1-\overline{ X }_2> c \text { or } \overline{ X }_1-\overline{ X }_2<- c ; \theta=0\right) \\ &=1- P \left(- c \leq \overline{ X }_1-\overline{ X }_2 \leq c ; \theta=0\right) \end{aligned} \end{split}\]
\[ =1- P \left(- c \leq \overline{ X }_1-\overline{ X }_2 \leq c ; \theta=0\right) \]
\[\begin{split} \begin{aligned} &\alpha=1- P \left(\frac{- c }{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \leq Z \leq \frac{ c }{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}\right) \\ &1-\alpha= P \left(\frac{- c }{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \leq Z \leq \frac{ c }{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}\right) \end{aligned} \end{split}\]
T test
Step Four

Conclusion

Reject \(H _0\), in favor of \(H _1\), if

\[ \overline{ X }_1-\overline{ X }_2> z _{\alpha / 2} \sqrt{\frac{\sigma_1^2}{ n _1}+\frac{\sigma_2^2}{ n _2}} \]

or

\[ \overline{ X }_1-\overline{ X }_2<- z _{\alpha / 2} \sqrt{\frac{\sigma_1^2}{ n _1}+\frac{\sigma_2^2}{ n _2}} \]
Example
\[\begin{split} \begin{array}{ll} n _1=57 & n _2=63 \\ \overline{ x }_1=77.2 & \overline{ x }_2=75.3 \\ \sigma_1^2=15.3 & \sigma_2^2=19.7 \end{array} \end{split}\]

Suppose that \(\alpha=0.05\). $\( \begin{aligned} & z _{\alpha / 2}= z _{0.025}=1.96 \\ & z _{\alpha / 2} \sqrt{\frac{\sigma_1^2}{ n _1}+\frac{\sigma_2^2}{ n _2}}=1.49 \end{aligned} \)$

\[\begin{split} \begin{aligned} & z _{\alpha / 2} \sqrt{\frac{\sigma_1^2}{ n _1}+\frac{\sigma_2^2}{ n _2}}=1.49 \\ &\overline{ x }_1-\overline{ x }_2=77.2-75.3=1.9 \end{aligned} \end{split}\]

So,

\[ \bar{x}_1-\bar{x}_2>z_{\alpha / 2} \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} \]

and we reject \(H _0\). The data suggests that the true mean scores for the counties are different!

Two Sample t-Tests for a Difference of Means

Fifth grade students from two neighboring counties took a placement exam.

  • Group 1, from County A, consisted of 8 students. The sample mean score for these students was \(77.2\) and the sample variance is \(15.3\).

  • Group 2, from County B, consisted of 10 students and had a sample mean score of \(75.3\) and the sample variance is 19.7.

Pooled Variance
\[ S_p^2=\frac{\left(n_1-1\right) S_1^2+\left(n_2-1\right) S_2^2}{n_1+n_2-2} \]
Step Four

Reject \(H _0\), in favor of \(H _1\), if

\[ \bar{X}_1-\bar{X}_2>t_{\alpha / 2, n_1+n_2-2} \sqrt{\left(\frac{1}{n_1}+\frac{1}{n_2}\right) S_P^2} \]

or

\[ \bar{X}_1-\bar{X}_2<-t_{\alpha / 2, n_1+n_2-2} \sqrt{\left(\frac{1}{n_1}+\frac{1}{n_2}\right) s_P^2} \]
\[\begin{split} \begin{array}{rlr} n _1=8 & n _1=10 \\ \overline{ x }_1=77.2 & \overline{ x }_1=75.3 \\ s _1^2=15.3 & s _2^2=19.7 \\ \alpha=0.01 & t _{0.005,16}=2.92 \\ s _{ p }^2 & =\frac{\left( n _1-1\right) S _1^2+\left( n _2-1\right) S _2^2}{ n _1+ n _2-2} \\ & =17.775 \end{array} \end{split}\]
\[\begin{split} \begin{aligned} &\overline{ x }_1-\overline{ x }_2=77.2-75.3=1.9 \\ & t _{\alpha / 2, n _1+ n _2-2} \sqrt{\left(\frac{1}{ n _1}+\frac{1}{ n _2}\right) S _{ P }^2} \\ &\quad=2.92 \sqrt{\left(\frac{1}{8}+\frac{1}{10}\right)(17.775)} \\ &=5.840 \end{aligned} \end{split}\]

Since \(\bar{x}_1-\bar{x}_2=1.9\) is not above \(5.840\), or below \(-5.840\) we fail to reject \(H _0\), in favor of \(H _1\) at \(0.01\) level of significance.

The data do not indicate that there is a significant difference between the true mean scores for counties \(A\) and \(B\).

Welch’s Test and Paired Data

Two Populations: Test

\[\begin{split} \begin{aligned} & H _0: \mu_1=\mu_2 \\ & H _1: \mu_1 \neq \mu_2 \end{aligned} \end{split}\]
  • Suppose that \(X_{1,1}, X_{1,2}, \ldots, X_{1, n_1}\) is a random sample of size \(n_1\) from the normal
    distribution with mean \(\mu_1\) and variance \(\sigma_1^2\).

  • Suppose that \(X_{2,1}, X_{2,2}, \ldots, X_{2, n}\) is a random sample of size \(n_2\) from the normal distribution with mean \(\mu_2\) and variance \(\sigma_2^2\).

  • Suppose that \(\sigma_1^2\) and \(\sigma_2^2\) are unknown and that the samples are independent. Don’t assume that \(\sigma_1^2\) and \(\sigma_2^2\) are equal!

Welch says that:

\[ \frac{\bar{X}_1-\bar{X}_2-\left(\mu_1-\mu_2\right)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}} \]

has an approximate t-distribution with \(r\) degrees of freedom where

\[ r=\frac{S_1^2 / n_1+S_2^2 / n_2}{\frac{\left(S_1^2 / n_1\right)^2}{n_1-1}+\frac{\left(S_2^2 / n_2\right)^2}{n_2-1}} \]

rounded down.

Critical values in Hypothesis Testing
Example

A random sample of 6 students’ grades were recorded for Midterm 1 and Midterm 2. Assuming exam scores are normally distributed, test whether the true (total population of students) average grade on Midterm 2 is greater than Midterm 1. Ξ± = 0.05

Student

Midterm 1 Grade

Midterm 2 Grade

1

72

81

2

93

89

3

85

87

4

77

84

5

91

100

6

84

82

Student

Midterm 1 Grade

Midterm 2 Grade

Differences: minus 2 Midterm 1

1

72

81

9

2

93

89

-4

3

85

87

2

4

77

84

7

5

91

100.

9

6

84

82

-2

The Hypotheses: Let \(\mu\) be the true average difference for all students.

\[\begin{split} \begin{aligned} & H _0: \mu=0 \\ & H _1: \mu>0 \end{aligned} \end{split}\]

This is simply a one sample t-test on the differences.

Data:

\[ 9,-4,2,7,9,-2 \]
\[ \sum x_i=23 \quad \sum x_i^2=267 \quad n=6 \]

This is simply a one sample t-test on the differences.

This is simply a one sample t-test on the differences.

\[\begin{split} \begin{aligned} &\bar{x}=3.5 \\ &s^2=\frac{\sum x_i^2-\left(\sum x_i\right)^2 / n}{n-1}=32.3 \end{aligned} \end{split}\]
\[ t _{\alpha, n -1}= t _{0.05,5}=2.01 \]

Reject \(H _0\), in favor of \(H _1\), if

\[ \overline{ X }>\mu_0+ t _{\alpha, n -1} \frac{ S }{\sqrt{ n }} \]

3.5 > 4.6

Conclusion: We fail to reject h0 , in favor of h1 , at 0.05 level of significance.

These data do not indicate that Midterm 2 scores are higher than Midterm 1 scores

Comparing Two Population Proportions

A random sample of 500 people in a certain county which is about to have a national election were asked whether they preferred β€œCandidate A” or β€œCandidate B”. From this sample, 320 people responded that they preferred Candidate A.

A random sample of 400 people in a second county which is about to have a national election were asked whether they preferred β€œCandidate A” or β€œCandidate B”.

From this second county sample, 268 people responded that they preferred Candidate \(A\).

\[\begin{split} \begin{aligned} &\hat{p}_1=\frac{320}{500}=0.64 \\ &\hat{p}_2=\frac{268}{400}=0.67 \end{aligned} \end{split}\]

Test

\[ H _0: p _1= p _2 \quad H _1: p _1 \neq p _2 \]

Change to:

\[\begin{split} \begin{aligned} & H _0: p _1- p _2=0 \\ & H _1: p _1- p _2 \neq 0 \end{aligned} \end{split}\]

Estimate \(p_1-p_2\) with \(\hat{p}_1-\hat{p}_2\) For large enough samples,

\[ \widehat{p}_1^{\text {approx }} N \left( p _1, \frac{ p _1\left(1- p _1\right)}{ n _1}\right) \]

and

\[ \hat{ p }_2^{\text {approx }} N \left( p _2, \frac{ p _2\left(1- p _2\right)}{ n _1}\right) \]
\[\begin{split} \begin{gathered} \hat{p}_1-\hat{p}_2 \sim N(?, ?) \\ E \left[\hat{p}_1-\hat{p}_2\right]=E\left[\hat{p}_1\right]- E \left[\hat{p}_2\right]=p_1-p_2 \\ \operatorname{Var}\left[\hat{p}_1-\hat{p}_2\right] \stackrel{\text { indep }}{=} \operatorname{Var}\left[\hat{p}_1\right]+\operatorname{Var}\left[\hat{p}_2\right] \\ =\frac{p_1\left(1-p_1\right)}{n_1}+\frac{p_2\left(1-p_2\right)}{n_2} \end{gathered} \end{split}\]

Use estimators for p1 and p2 assuming they are the same.

  • Call the common value p.

  • Estimate by putting both groups together.

\[ \hat{p}_1=\frac{320}{500}=0.64 \quad \hat{p}_2=\frac{268}{400}=0.67 \]

we have

\[\begin{split} \begin{aligned} \hat{p}=\frac{320+268}{500+400}=& \frac{588}{900}=\frac{49}{75} \\ & \approx 0.6533 \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} Z &:=\frac{\hat{p}_1-\hat{p}_2-\left(p_1-p_2\right)}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}} \sim N(0,1)} \\ =& \frac{\hat{p}_1-\hat{p}_2-\left(p_1-p_2\right)}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} \end{aligned} \end{split}\]

Two-tailed test with z-critical values…

\[\begin{split} \begin{aligned} &\hat{p}=\frac{320+268}{500+400}=\frac{588}{900}=\frac{49}{75} \\ &Z=\frac{0.64-0.67-0}{\sqrt{0.6533(1-0.6533)\left(\frac{1}{500}+\frac{1}{400}\right)}} \end{aligned} \end{split}\]

= 0.9397

\[ z _{0.025}=1.96 \]

qnorm(1-0.05/2)

\(Z=-0.9397\) does not fall in the rejection region!

Hypothesis Tests for the Exponential

Suppose that \(X_1, X_2, \ldots, X_n\) is a random sample from the exponential distribution with rate \(\lambda>0\). Derive a hypothesis test of size \(\alpha\) for

\[ H _0: \lambda=\lambda_0 \text { vs. } H _1: \lambda>\lambda_0 \]

What statistic should we use?

Test 1: Using the Sample Mean
Step One

Choose a statistic.

\[ \bar{x} \]
Step Two

Give the form of the test Reject 𝖧0 , in favor of h1 , if 𝖷_bar < 𝖼

for some c to be determined.

Critical values in Hypothesis Testing
Step Three
\[\begin{split} \begin{aligned} \alpha &= P (\text { Type I Error }) \\ &= P \left(\text { Reject } H _0 ; \lambda_0\right) \\ &= P \left(\overline{ X }< c ; \lambda_0\right) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &= P \left(2 n \lambda_0 \overline{ x }<2 n \lambda_0 c ; \lambda_0\right) \\ &= P \left( W <2 n \lambda_0 c ; \lambda_0\right) \\ \text { where } W \sim \chi^2(2 n ) \end{aligned} \end{split}\]
Critical values in Hypothesis Testing
Step Four

Reject \(H _0\), in favor of \(H _1\), if

\[ \bar{x}<\frac{\chi_{1-\alpha, 2 n}^2}{2 n \lambda_0} \]

\(\chi_{\alpha, n }^2\) In R, get \(\chi_{0.10,6}^2\)

by typing qchisq(0.90,6)

Best Test

UMP Tests

Suppose that \(X_1, X_2, \ldots, X_n\) is a random sample from the exponential distribution with rate \(\lambda>0\).

Derive a uniformly most powerful hypothesis test of size \(\alpha\) for

\[\begin{split} \begin{array}{r} H _0: \lambda=\lambda_0 \quad \text { vs. } \quad H _1: \lambda>\lambda_0 \\ \left(\text { Was } H _1: \lambda=\lambda_1 \text { for } \lambda_1>\lambda_0\right) \end{array} \end{split}\]
Step One

Consider the simple versus simple hypotheses

\[ H _0: \lambda=\lambda_0 \quad \text { vs. } H _1: \lambda=\lambda_1 \]

for some fixed \(\lambda_1>\lambda_0\).

###Steps Two, Three, and Four

Find the best test of size \(\alpha\) for

\[ H _0: \lambda=\lambda_0 \text { vs. } H _1: \lambda=\lambda_1 \]

for some fixed \(\lambda_1>\lambda_0\). This test is to reject \(H _0\), in favor of \(H _1\) if

\[ \overline{ x }<\frac{\chi_{1-\alpha, 2 n }^2}{2 n \lambda_0} \]

Note that this test does not depend on the particular value of \(\lambda_1\). -It does, however, depend on the fact that \(\lambda_1>\lambda_0\)

The β€œUMP” test for

\[ H _0: \lambda=\lambda_0 \text { vs. } H _1: \lambda>\lambda_0 \]

is to reject \(H_0\), in favor of \(H_1\) if

\[ \overline{ x }<\frac{\chi_{1-\alpha, 2 n }^2}{2 n \lambda_0} \]

The β€œUMP” test for

\[ H _0: \lambda=\lambda_0 \text { vs. } H _1: \lambda<\lambda_0 \]

is to reject \(H_0\), in favor of \(H_1\) if

\[ \overline{ x }>\frac{\chi_{, 2 n }^2}{2 n \lambda_0} \]

Test for the Variance of the Normal Distribution

Suppose that \(X_1, X_2, \ldots, X_n\) is a random sample from the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Derive a test of size/level \(\alpha\) for

\[ H _0: \sigma^2 \geq \sigma_0^2 \quad \text { vs. } H_1: \sigma^2<\sigma_0^2 \]
step 1

Choose a statistic/estimator for \(\sigma^2\)

\[ s^2=\frac{\sum_{i=1}^n\left(X_i-\bar{X}\right)^2}{n-1} \]
step 2

Give the form of the test. Reject \(H_0\), in favor of \(H_1\), if

\[ S ^2< C \]

for some \(c\) to be determined.

step 3

find c using alpha

\[\begin{split} \begin{aligned} \alpha &=\max P (\text { Type I Error }) \\ &=\max _{\sigma^2 \geq \sigma_0^2} P \left(\text { Reject } H _0 ; \sigma^2\right) \\ &=\max _{\sigma^2 \geq \sigma_0^2} P \left( S ^2< c ; \sigma^2\right) \end{aligned} \end{split}\]
\[\begin{split} \begin{aligned} &= P \left(\left(\frac{( n -1) S ^2}{\sigma^2}\right) \frac{( n -1) c }{\sigma^2} ; \sigma^2\right) \\ &= P \left( W <\frac{( n -1) c }{\sigma^2}\right) \\ &\text { where } W \sim \chi^2( n -1) \end{aligned} \end{split}\]
Critical values in Hypothesis Testing
Step 4

Reject \(H _0\), in favor of \(H _1\), if

\[ S^2<\frac{\sigma_0^2 \chi_{1-\alpha, n-1}^2}{n-1} \]
Example

A lawn care company has developed and wants to patent a new herbicide applicator spray nozzle. Example: For safety reasons, they need to ensure that the application is consistent and not highly variable. The company selected a random sample of 10 nozzles and measured the application rate of the herbicide in gallons per acre

The measurements were recorded as

\(0.213,0.185,0.207,0.163,0.179\)
\(0.161,0.208,0.210,0.188,0.195\)

Assuming that the application rates are normally distributed, test the following hypotheses at level \(0.04\).

\[ H _0: \sigma^2=0.01 \quad H _1: \sigma^2>0.01 \]

Get sample variance in \(R\).

\[\begin{split} \begin{array}{r} x<-c(0.213,0.185,0.207,0.163,0.179 \\ 0.161,0.208,0.210,0.188,0.195) \end{array} \end{split}\]

or

\[ x<-\operatorname{scan} 0 \]

Hit and then input numbers, one by one, hitting in between and <Enter \(>\) at the end.

Compute variance by typing

\[ \operatorname{var}( x ) \]

or \(\left(\left(\operatorname{sum}\left(x^{\wedge} 2\right)-\left(\operatorname{sum}(x)^{\wedge} 2\right) / 10\right) / 9\right.\) Result: \(0.000364\)

Reject \(H_0\), in favor of \(H_1\), if \(S^2>c\).

\[\begin{split} \begin{aligned} &\alpha= P \left( S ^2> c ; \sigma^2=0.01\right) \\ &= P \left(\frac{( n -1) S ^2}{\sigma^2}>\frac{9 c }{0.01} ; \sigma^2=0.01\right) \\ &= P \left( W >\frac{9 c }{0.01}\right) \end{aligned} \end{split}\]

Reject \(H _0\), in favor of \(H _1\), if \(S ^2> c\)

\[\begin{split} \begin{gathered} 0.04= P \left( W >\frac{9 c}{0.01}\right) \\ \frac{9 c}{0.01}=\chi_{0.04,9}^2=17.61 \\ \text { qchisq(1-0.04,9) } \end{gathered} \end{split}\]

Reject \(H_0\), in favor of \(H_1\), if \(S^2>c\)

\[\begin{split} \begin{aligned} &c=(17.61)(0.01) / 9 \approx 0.0196 \\ &s^2=0.000364 \end{aligned} \end{split}\]

Fail to reject \(H _0\), in favor of \(H _1\), at level 0.04. There is not sufficient evidence in the data to suggest that \(\sigma^2>0.01\).

Introduction

Calculus is a branch of mathematics that gives tools to study rate of change of functions trough two main areas: derivatives and integrals.

In the context of machine learning and data science, you can for instance use derivatives to optimize the parameters of a model with gradient descent. You might use integrals to calculate area under the curve.

Functions

A function is a rule that takes one or more inputs and produces a single output. For example, the function \(f(x) = x + 1\) takes a single input \(x\), adds one to it, and produces a single output. In algebra, functions are written using symbols and formulas. For example, the function \(f(x) = x + 1\) can be written as \(f:x \rightarrow x + 1\). The input to a function is called the argument or input variable. The output is called the value or output variable.

Functions are often written using the following notation:

\[y = f(x)\]

The notation above is read as β€œ\(y\) equals \(f\) of \(x\)” or β€œ\(y\) is a function of \(x\)”. The notation above is useful because it allows us to define a function without specifying its name. For example, we can define a function \(f\) as follows:

\[f(x) = x^2\]

We can then use the function \(f\) to compute the square of any number. For example, \(f(2) = 2^2 = 4\) and \(f(3) = 3^2 = 9\).

\[\begin{split} \mathrm{f}(x) = \sqrt{x + {6}} \\ \mathrm{f}(6) = \sqrt{10 + {6}} \\ \mathrm{f}(6) = 4.0 \end{split}\]
\[\begin{split} \begin{gathered} f(x)=\frac{x-3}{x+2} \\ f(3)=\frac{3-3}{3+2}=\frac{0}{5}=0 \end{gathered} \end{split}\]
Domain and Range of a Function

The domain of a function is the set of all possible inputs to the function. The range of a function is the set of all possible outputs of the function. For example, the function \(f(x) = x^2\) has a domain of all real numbers and a range of all non-negative real numbers. The domain of a function is often written as \(D(f)\) and the range is often written as \(R(f)\).

\[\begin{split} y = f(x) \\ y = x^2 \end{split}\]
import seaborn as sb

func = lambda x: x ** 2

x = [-1,-2,-3, -4, 1, 2, 3, 4]
y = [func(i) for i in x]

sb.lineplot(x=x, y=y)
<Axes: >
_images/d483e2339ec96791d84e38395238fbc86ef7f51f83b933db0f63b3a024047134.png
Piecewise Functions

A piecewise function is a function that is defined by multiple sub-functions, each sub-function applying to a different interval of the main function’s domain. For example, the function \(f(x) = |x|\) is defined by two sub-functions:

\[\begin{split} f(x) = \begin{cases} x & \text{if } x \geq 0 \\ -x & \text{if } x < 0 \end{cases} \end{split}\]

Expoents

An exponent is a number that indicates how many times a base number is multiplied by itself. For example, \(2^3\) is the same as \(2 \times 2 \times 2\) and \(2^4\) is the same as \(2 \times 2 \times 2 \times 2\). The number \(2\) is called the base and the number \(3\) is called the exponent. Exponents are often written using the following notation:

\[2^3 = 2 \times 2 \times 2 = 8\]

The notation above is read as β€œtwo to the power of three” or β€œtwo cubed”.

Negative Exponents

A negative exponent indicates that the base number should be divided by itself a certain number of times. For example, \(2^{-3}\) is the same as \(\frac{1}{2^3}\) and \(2^{-4}\) is the same as \(\frac{1}{2^4}\). The number \(2\) is called the base and the number \(-3\) is called the exponent. Negative exponents are often written using the following notation:

\[2^{-3} = \frac{1}{2^3} = \frac{1}{8}\]

The notation above is read as β€œtwo to the power of negative three” or β€œtwo to the power of minus three”.

Fractional Exponents

A fractional exponent indicates that the base number should be multiplied by itself a certain number of times. For example, \(2^{\frac{1}{2}}\) is the same as \(\sqrt{2}\) and \(2^{\frac{1}{3}}\) is the same as \(\sqrt[3]{2}\). The number \(2\) is called the base and the number \(\frac{1}{2}\) is called the exponent. Fractional exponents are often written using the following notation:

\[2^{\frac{1}{2}} = \sqrt{2} = 1.414213562373095\]

The notation above is read as β€œtwo to the power of one half” or β€œtwo to the power of one over two”.

Logarithms

A logarithm is the inverse of an exponent. For example, the logarithm of \(2^3\) is \(3\). The logarithm of a number \(x\) to the base \(b\) is written as \(\log_b(x)\). For example, \(\log_2(8) = 3\) because \(2^3 = 8\).

Common Logarithms

The common logarithm of a number \(x\) is the logarithm of \(x\) to the base \(10\). The common logarithm of \(x\) is written as \(\log(x)\). For example, \(\log(100) = 2\) because \(10^2 = 100\).

Natural Logarithms

The natural logarithm of a number \(x\) is the logarithm of \(x\) to the base \(e\). The natural logarithm of \(x\) is written as \(\ln(x)\). For example, \(\ln(100) = 4.60517\) because \(e^{4.60517} = 100\).

Polynomials

A polynomial is an expression consisting of variables and coefficients, that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponents.

For example, \(x^2 + 2x + 1\) is a polynomial because it consists of the variables \(x\) and the coefficients \(1\) and \(2\).

The degree of a polynomial is the highest degree of its terms. For example, the polynomial \(x^2 + 2x + 1\) has a degree of \(2\) because its highest degree term is \(x^2\).

Derivatives and Partial Derivatives

Everything around us is changing, the universe is expanding, planets are moving, people are aging, even atoms don’t stay in the same state, they are always moving or changing. Everything is changing with time. So how do we measure it?

How things change?

Suppose we are going on a car trip with our family. The speed of the car is constantly changing. Similarly the temperature at any given point on a day is changing. The overall temperature of Earth is changing. So we need a way to measure that change. Let’s take the example of a family trip. Suppose the overall journey of our trip looks like this:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('darkgrid')

params = {'legend.fontsize': 'medium',
          'figure.figsize': (10, 8),
          'figure.dpi': 100,
         'axes.labelsize': 'medium',
         'axes.titlesize':'medium',
         'xtick.labelsize':'medium',
         'ytick.labelsize':'medium'}
plt.rcParams.update(params)

# Sample data (replace this with your own data)
time = [1, 2, 3, 4, 5]
distance = [10, 20, 25, 35, 40]

# Create a scatter plot
sns.scatterplot(x=time, y=distance, label='Distance vs. Time')

# Create a line plot on top of the scatter plot
sns.lineplot(x=time, y=distance, color='red', label='Distance Line')

# Add labels and title
plt.xlabel('Time')
plt.ylabel('Distance')
plt.title('Distance vs. Time with Distance Line')

# Show legend
plt.legend()

# Show the plot
plt.show()
_images/75a92b45d110bb61152a7c7970aa28c26ca0fe16445c2a6243b0666d95672f9f.png

Average vs Instantaneous rate of change

If we look at the graph, we can see that the car covered 40 miles in 5 hours. Now the average speed or the average rate of change will be the total distance divided by total time of the whole journey which in case is

\[ \frac{\text { Total Distance }}{\text { Total time taken }}=\frac{40}{5}=8 \text { miles per hour } \]

But what if we want to find the rate of change at any given time, this is where the instantaneous rate of change comes into play. The instantaneous rate of change is given by the change at any given time or point. Take the example of a speedometer which gives you the change in speed every moment. We can also calculate the instantaneous change using a graph using the concept of a slope.

Slope of a line

Slope of a line is simply defined as the rate of change in the vertical direction due to rate of change in the horizontal direction or simply

\[ \text { Slope }=\frac{\text { Rate of change in } y}{\text { Rate of change in } x}=\frac{\Delta y}{\Delta x}=\frac{y 2-y 1}{x 2-x 1} \]

But what about the instantaneous rate of change? Well, if you look at a curved line, the slope will be different for different points.

This instantaneous rate of change at any point is called the derivative at that point and is defined as:

\[ \frac{\text { Rate of change in } y}{\text { Very small change in } x}=\frac{d y}{d x}=\frac{d}{d x}(f(x)) \]

Derivative Explained

The derivative of a function is related to its rate of change. The rate of change tells you how much the output of the function changes when a change is done to the input. It is calculated as the ratio between a change in the output and the corresponding change in the input.

Graphically, it is the slope of the tangent at a given point of the function.

Let’s take an example. Suppose we have a function like this:

\[ y=f(x)=x^2 \]

The graph of this function looks like this

# seaborn grah of x squared function
x = [-5, -4, -3, -2, -1 , 0 , 1, 2, 3, 4, 5]
func = lambda i: i**2

y = [func(i) for i in x]

sns.lineplot(x=x, y=y)
plt.show()
_images/88f00ab9c32a5393e4974753fb932ec569049fdaa7524c488c3258680e9b439d.png

Now let’s say we want to find the slope or instantaneous change in y due to x. Let’s say x change from x to x+h then:

\[ y=f(x)=x^2 \]

Lets say y changes due change in x so

\[ x_2 = x +h \]

then

\[ y_2 = f(x + h) \]
\[\begin{split}m = \frac{y_2-y_1}{x_2-x_1} \\ = \frac{f(x+h)-f(x)}{(x+h)-x} \\ = \frac{(x+h)^2-x^2}{(x+h)-x} \\ = \frac{x^2 + h^2 + 2xh -x^2}{x-x -h} \\ = \frac{h (2x + h)}{h} \\ = 2x\end{split}\]

since h -> 0 so h will be zero.

Partial Derivatives

Now considering the derivative of a function with a single input, a partial derivative of a function is just the derivative of a function with multiple inputs with respect to a single variable i.e. the change in that function caused by the change in a single input. Let’s suppose a function

\[ f(x, y)=x^2 y+\sin y \]

Now we cannot find the derivative of this function directly since it depends on two inputs. So what we do is we find the derivative of this function assuming that one of the inputs is constant. Or simply that what change in the function is caused by the slight change in that single input. Let’s find the partial derivative of this function with respect to both inputs one by 1

To calculate the partial derivatives of the given function \((f(x, y) = x^2y + \sin(y)) \) with respect to x and y, we will find the derivative of each term separately and then combine them using the rules of partial differentiation.

\[ \frac{\partial(f(x, y))}{\partial x}=\lim _{h \rightarrow 0} \frac{f(x+h, y)-f(x, y)}{h} \]

Partial derivative with respect to x

denoted as \(\frac{\partial f}{\partial x}\):

We treat \(y\) as a constant when taking the derivative with respect to \(x\). Therefore, we differentiate \(x^2y\) with respect to \(x\) while keeping \(y\) constant:

\[ \frac{\partial}{\partial x}(x^2y) = 2xy \]

The derivative of \(\sin(y)\) with respect to \(x\) is 0 because \(\sin(y)\) does not depend on \(x\). So, \(\frac{\partial}{\partial x}(\sin(y)) = 0\).

Now, we can combine these partial derivatives:

\[ \frac{\partial f}{\partial x} = 2xy + 0 = 2xy \]

So, the partial derivative of \(f(x, y)\) with respect to \(x is \)2xy$.

\[ \frac{\partial f}{\partial x}=2 x y \]

Partial derivative with respect to y

denoted as \(\frac{\partial f}{\partial y}\)

Now, we treat x as a constant when taking the derivative with respect to y. Therefore, we differentiate \((x^2y\) with respect to \(y\) while keeping \(x\) constant:

\[ \frac{\partial}{\partial y}(x^2y) = x^2 \]

The derivative of \(\sin(y)\) with respect to y is \(\cos(y)\), so \(\frac{\partial}{\partial y}(\sin(y)) = \cos(y)\).

Now, we can combine these partial derivatives:

\[ \frac{\partial f}{\partial y} = x^2 + \cos(y) \]

So, the partial derivative of \(f(x, y)\) with respect to y is \(x^2 + \cos(y)\).

In summary, the partial derivatives of the given function \(f(x, y) = x^2y + \sin(y)\) are:

\[ \frac{\partial f}{\partial x} = 2xy \]

and

\[ \frac{\partial f}{\partial y} = x^2 + \cos(y) \]

Now if we want to find the partial derivative of the function at the point (-1, 2). We can just chug in values in both partial equations and find the change as:

\[ \frac{\partial f}{\partial x}(-1,2)=2 x y=2 \cdot(-1) \cdot(2)=-4 \]

Similarly

\[ \frac{\partial f}{\partial y}(-1,2)=x^2+\cos y=(-1)^2+\cos (2)=0.5838 \]

So we say that the change in the function with respect to input x is -4 times and with respect to y is +0.5838.

This means that the function is more sensitive to x than to y.

E.g https://www.youtube.com/watch?v=dfvnCHqzK54

https://www.youtube.com/watch?v=wqPt3qjB6uA&ab_channel=Dr.DataScience

https://www.youtube.com/watch?v=sIX_9n-1UbM

Derivative rules

Constant Rule

If you have a number (like 5 or 10) all by itself, its derivative is always 0. This means it doesn’t change when you take the derivative.

A constant represents a horizontal line on the graph, which has no slope (i.e., it’s perfectly flat). Therefore, the rate of change (derivative) is zero.

Proof: Let \(f(x) = c\), where c is a constant. Then, by definition, the derivative of \(f(x)\) is

\[ \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} = \lim_{h \to 0} \frac{c - c}{h} = \lim_{h \to 0} \frac{0}{h} = 0 \]
Power Rule

If you have a number with an exponent (like \(x^2\) or \(x^3\)), you can bring the exponent down and subtract 1 from it. For example, if you have \(x^2\), the derivative is 2x because 2 times \(x^1\) is 2x.

The derivative of \(x^n\) with respect to x is \(nx^{n-1}\), where n is a constant. This rule is derived using the limit definition of the derivative and the binomial theorem.

Proof: Start with the limit definition of the derivative

\[ \frac{d}{dx}(x^n) = \lim_{h \to 0} \frac{(x+h)^n - x^n}{h} \]

Use the binomial theorem to expand \((x+h)^n\) $\( (x+h)^n = x^n + nx^{n-1}h + \text{higher order terms in } h \)$ Substitute this into the limit definition

\[ \frac{d}{dx}(x^n) = \lim_{h \to 0} \frac{(x^n + nx^{n-1}h + \text{higher order terms in } h) - x^n}{h} \]

Cancel the \(x^n\) terms and divide by \(h\)

\[ \frac{d}{dx}(x^n) = \lim_{h \to 0} \frac{nx^{n-1}h + \text{higher order terms in } h}{h} \]

Simplify and take the limit as h approaches 0: $\( \frac{d}{dx}(x^n) = nx^{n-1} \)$

Sum Rule

If you’re adding or subtracting two things, like f(x) + g(x) or f(x) - g(x), you can take the derivative of each thing separately and keep them separate. For example, if you have \(3x^2 + 4x\), you can find the derivative of \(3x^2\) (which is 6x) and the derivative of 4x (which is 4), and then you keep them together as 6x + 4.

Chain Rule

Sometimes, you have functions inside of functions. Imagine you have \(f(g(x))\). To find the derivative of that, you first find the derivative of the outer function (f) and then the derivative of the inner function (g). You multiply them together. It’s like doing things step by step.

The chain rule is a fundamental rule in calculus that allows us to find the derivative of a composite function. In other words, it tells us how to differentiate a function that is composed of two or more functions. The chain rule is often stated as follows:

If f(u) and g(x) are differentiable functions, then the derivative of their composition f(g(x)) is given by:

\[ \frac{d}{dx} [f(g(x))] = f'(g(x)) \cdot g'(x) \]

Here’s an explanation of the chain rule with an example and a proof:

Explanation:

The chain rule essentially states that to find the derivative of a composite function, you first take the derivative of the outer function with respect to its inner function and then multiply it by the derivative of the inner function with respect to the variable of interest (in this case, x).

Example

Let’s use an example to illustrate the chain rule. Consider the function \(y = f(u) = u^2\) and \(u = g(x) = x^3\). We want to find the derivative of \(y\) with respect to \(x\) which is \(\frac{dy}{dx}\).

  1. Find \(\frac{dy}{du}\): This is the derivative of the outer function \(f(u)\) with respect to its inner function \(u\), which is \(2u\).

  2. Find \(\frac{du}{dx}\): This is the derivative of the inner function \(g(x)\) with respect to \(x\), which is \(3x^2\).

  3. Apply the chain rule:

\[ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = (2u) \cdot (3x^2) = 2 \cdot 3x^2u = 6x^2u \]

Now, substitute back \(u = x^3\):

\[ \frac{dy}{dx} = 6x^2 \cdot (x^3) = 6x^5 \]

So, the derivative of \(y = x^6\) with respect to \(x\) is \(6x^5\).

Proof:

The proof of the chain rule relies on the definition of the derivative and the limit concept. Let’s prove it step by step:

  1. Start with the definition of the derivative of a function:

\[ \frac{d}{dx} [f(u)] = \lim_{{h \to 0}} \frac{f(u + h) - f(u)}{h} \]
  1. Now, we want to find the derivative of the composition \(f(g(x))\). Let \(v = g(x)\). So, we can write:

\[ \frac{d}{dx} [f(g(x))] = \frac{d}{dx} [f(v)] \]
  1. Using the definition of the derivative for \(f(v)\), we have:

\[ \frac{d}{dx} [f(v)] = \lim_{{h \to 0}} \frac{f(v + h) - f(v)}{h} \]
  1. Rewrite \(v + h\) as \(g(x + h)\) because \(v = g(x)\):

\[ \frac{d}{dx} [f(v)] = \lim_{{h \to 0}} \frac{f(g(x + h)) - f(g(x))}{h} \]
  1. Now, we can see that this is precisely the definition of the derivative of \(f(g(x))\). So, we’ve shown that:

\[ \frac{d}{dx} [f(g(x))] = \lim_{{h \to 0}} \frac{f(g(x + h)) - f(g(x))}{h} \]
  1. And this can be simplified to:

\[ \frac{d}{dx} [f(g(x))] = f'(g(x)) \cdot g'(x) \]

So, the chain rule is proven. It tells us how to find the derivative of a composite function by considering the derivatives of its components.

Backpropagation Chain rule

The chain rule is a crucial concept in neural network backpropagation, which is the algorithm used to train neural networks. It allows us to efficiently calculate the gradients of the loss function with respect to the network’s parameters (weights and biases) by decomposing the overall gradient into smaller gradients associated with each layer of the network. Let’s explain how the chain rule is used in neural network backpropagation with an example.

Suppose we have a simple feedforward neural network with one hidden layer. Here’s a simplified network architecture:

  • Input layer with \(n\) neurons

  • Hidden layer with \(m\) neurons

  • Output layer with \(k\) neurons

The network has weights \(W^{(1)}\) for the connections between the input and hidden layers and weights \(W^{(2)}\) for the connections between the hidden and output layers.

The forward pass of the network involves the following steps:

Compute the weighted sum and apply an activation function to the hidden layer:

\[ z^{(1)} = XW^{(1)} + b^{(1)} \]
\[ a^{(1)} = \sigma(z^{(1)}) \]

where \(X\) is the input, \(W^{(1)}\) are the weights of the first layer, \(b^{(1)}\) are the biases of the first layer, \(\sigma(\cdot)\) is the activation function (e.g., sigmoid or ReLU), and \(a^{(1)}\) is the output of the hidden layer.

Compute the weighted sum and apply an activation function to the output layer:

\[ z^{(2)} = a^{(1)}W^{(2)} + b^{(2)} \]
\[ a^{(2)} = \sigma(z^{(2)}) \]

where \(W^{(2)}\) are the weights of the second (output) layer, \(b^{(2)}\) are the biases of the second layer, and \(a^{(2)}\) is the final output of the network.

Now, let’s assume we have a loss function \(L\) that measures the error between the predicted output \(a^{(2)}\) and the true target values \(Y\). The goal of backpropagation is to update the network’s weights and biases to minimize this loss.

To do this, we need to compute the gradients of the loss with respect to the network’s parameters. The chain rule comes into play during this step. We calculate the gradients layer by layer, propagating the gradient backward through the network:

  1. Compute the gradient of the loss with respect to the output layer’s activations: $\( \frac{\partial L}{\partial a^{(2)}} \)$

  2. Use the chain rule to calculate the gradient of the loss with respect to the output layer’s weighted sum (\(z^{(2)}\)): $\( \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \)$

  3. Compute the gradient of the loss with respect to the second layer’s weights and biases (\(W^{(2)}\) and \(b^{(2)}\)): $\( \frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial W^{(2)}} \)\( \)\( \frac{\partial L}{\partial b^{(2)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial b^{(2)}} \)$

  4. Use the chain rule again to calculate the gradient of the loss with respect to the hidden layer’s activations (\(a^{(1)}\)): $\( \frac{\partial L}{\partial a^{(1)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial a^{(1)}} \)$

  5. Compute the gradient of the loss with respect to the hidden layer’s weighted sum (\(z^{(1)}\)): $\( \frac{\partial L}{\partial z^{(1)}} = \frac{\partial L}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \)$

  6. Finally, calculate the gradient of the loss with respect to the first layer’s weights and biases (\(W^{(1)}\) and \(b^{(1)}\)): $\( \frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial W^{(1)}} \)\( \)\( \frac{\partial L}{\partial b^{(1)}} = \frac{\partial L}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial b^{(1)}} \)$

The chain rule allows us to compute these gradients efficiently by breaking down the overall gradient into smaller gradients associated with each layer. Once we have these gradients, we can use them to update the network’s weights and biases using optimization algorithms like gradient descent. This iterative process of forward and backward passes, driven by the chain rule, is how neural networks are trained to learn from data.

# Importing Required Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

%matplotlib inline
#Load the dataset
data = load_iris()
data
{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
        [5.5, 4.2, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.2],
        [5. , 3.2, 1.2, 0.2],
        [5.5, 3.5, 1.3, 0.2],
        [4.9, 3.6, 1.4, 0.1],
        [4.4, 3. , 1.3, 0.2],
        [5.1, 3.4, 1.5, 0.2],
        [5. , 3.5, 1.3, 0.3],
        [4.5, 2.3, 1.3, 0.3],
        [4.4, 3.2, 1.3, 0.2],
        [5. , 3.5, 1.6, 0.6],
        [5.1, 3.8, 1.9, 0.4],
        [4.8, 3. , 1.4, 0.3],
        [5.1, 3.8, 1.6, 0.2],
        [4.6, 3.2, 1.4, 0.2],
        [5.3, 3.7, 1.5, 0.2],
        [5. , 3.3, 1.4, 0.2],
        [7. , 3.2, 4.7, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [5.5, 2.3, 4. , 1.3],
        [6.5, 2.8, 4.6, 1.5],
        [5.7, 2.8, 4.5, 1.3],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 2.4, 3.3, 1. ],
        [6.6, 2.9, 4.6, 1.3],
        [5.2, 2.7, 3.9, 1.4],
        [5. , 2. , 3.5, 1. ],
        [5.9, 3. , 4.2, 1.5],
        [6. , 2.2, 4. , 1. ],
        [6.1, 2.9, 4.7, 1.4],
        [5.6, 2.9, 3.6, 1.3],
        [6.7, 3.1, 4.4, 1.4],
        [5.6, 3. , 4.5, 1.5],
        [5.8, 2.7, 4.1, 1. ],
        [6.2, 2.2, 4.5, 1.5],
        [5.6, 2.5, 3.9, 1.1],
        [5.9, 3.2, 4.8, 1.8],
        [6.1, 2.8, 4. , 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.4, 2.9, 4.3, 1.3],
        [6.6, 3. , 4.4, 1.4],
        [6.8, 2.8, 4.8, 1.4],
        [6.7, 3. , 5. , 1.7],
        [6. , 2.9, 4.5, 1.5],
        [5.7, 2.6, 3.5, 1. ],
        [5.5, 2.4, 3.8, 1.1],
        [5.5, 2.4, 3.7, 1. ],
        [5.8, 2.7, 3.9, 1.2],
        [6. , 2.7, 5.1, 1.6],
        [5.4, 3. , 4.5, 1.5],
        [6. , 3.4, 4.5, 1.6],
        [6.7, 3.1, 4.7, 1.5],
        [6.3, 2.3, 4.4, 1.3],
        [5.6, 3. , 4.1, 1.3],
        [5.5, 2.5, 4. , 1.3],
        [5.5, 2.6, 4.4, 1.2],
        [6.1, 3. , 4.6, 1.4],
        [5.8, 2.6, 4. , 1.2],
        [5. , 2.3, 3.3, 1. ],
        [5.6, 2.7, 4.2, 1.3],
        [5.7, 3. , 4.2, 1.2],
        [5.7, 2.9, 4.2, 1.3],
        [6.2, 2.9, 4.3, 1.3],
        [5.1, 2.5, 3. , 1.1],
        [5.7, 2.8, 4.1, 1.3],
        [6.3, 3.3, 6. , 2.5],
        [5.8, 2.7, 5.1, 1.9],
        [7.1, 3. , 5.9, 2.1],
        [6.3, 2.9, 5.6, 1.8],
        [6.5, 3. , 5.8, 2.2],
        [7.6, 3. , 6.6, 2.1],
        [4.9, 2.5, 4.5, 1.7],
        [7.3, 2.9, 6.3, 1.8],
        [6.7, 2.5, 5.8, 1.8],
        [7.2, 3.6, 6.1, 2.5],
        [6.5, 3.2, 5.1, 2. ],
        [6.4, 2.7, 5.3, 1.9],
        [6.8, 3. , 5.5, 2.1],
        [5.7, 2.5, 5. , 2. ],
        [5.8, 2.8, 5.1, 2.4],
        [6.4, 3.2, 5.3, 2.3],
        [6.5, 3. , 5.5, 1.8],
        [7.7, 3.8, 6.7, 2.2],
        [7.7, 2.6, 6.9, 2.3],
        [6. , 2.2, 5. , 1.5],
        [6.9, 3.2, 5.7, 2.3],
        [5.6, 2.8, 4.9, 2. ],
        [7.7, 2.8, 6.7, 2. ],
        [6.3, 2.7, 4.9, 1.8],
        [6.7, 3.3, 5.7, 2.1],
        [7.2, 3.2, 6. , 1.8],
        [6.2, 2.8, 4.8, 1.8],
        [6.1, 3. , 4.9, 1.8],
        [6.4, 2.8, 5.6, 2.1],
        [7.2, 3. , 5.8, 1.6],
        [7.4, 2.8, 6.1, 1.9],
        [7.9, 3.8, 6.4, 2. ],
        [6.4, 2.8, 5.6, 2.2],
        [6.3, 2.8, 5.1, 1.5],
        [6.1, 2.6, 5.6, 1.4],
        [7.7, 3. , 6.1, 2.3],
        [6.3, 3.4, 5.6, 2.4],
        [6.4, 3.1, 5.5, 1.8],
        [6. , 3. , 4.8, 1.8],
        [6.9, 3.1, 5.4, 2.1],
        [6.7, 3.1, 5.6, 2.4],
        [6.9, 3.1, 5.1, 2.3],
        [5.8, 2.7, 5.1, 1.9],
        [6.8, 3.2, 5.9, 2.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.7, 3. , 5.2, 2.3],
        [6.3, 2.5, 5. , 1.9],
        [6.5, 3. , 5.2, 2. ],
        [6.2, 3.4, 5.4, 2.3],
        [5.9, 3. , 5.1, 1.8]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 'frame': None,
 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n:Number of Instances: 150 (50 in each of three classes)\n:Number of Attributes: 4 numeric, predictive attributes and the class\n:Attribute Information:\n    - sepal length in cm\n    - sepal width in cm\n    - petal length in cm\n    - petal width in cm\n    - class:\n            - Iris-Setosa\n            - Iris-Versicolour\n            - Iris-Virginica\n\n:Summary Statistics:\n\n============== ==== ==== ======= ===== ====================\n                Min  Max   Mean    SD   Class Correlation\n============== ==== ==== ======= ===== ====================\nsepal length:   4.3  7.9   5.84   0.83    0.7826\nsepal width:    2.0  4.4   3.05   0.43   -0.4194\npetal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\npetal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n============== ==== ==== ======= ===== ====================\n\n:Missing Attribute Values: None\n:Class Distribution: 33.3% for each of 3 classes.\n:Creator: R.A. Fisher\n:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n:Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n|details-start|\n**References**\n|details-split|\n\n- Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n  Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n  Mathematical Statistics" (John Wiley, NY, 1950).\n- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n  (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n  Structure and Classification Rule for Recognition in Partially Exposed\n  Environments".  IEEE Transactions on Pattern Analysis and Machine\n  Intelligence, Vol. PAMI-2, No. 1, 67-71.\n- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n  on Information Theory, May 1972, 431-433.\n- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n  conceptual clustering system finds 3 classes in the data.\n- Many, many more ...\n\n|details-end|\n',
 'feature_names': ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 'filename': 'iris.csv',
 'data_module': 'sklearn.datasets.data'}
# Making it a 2 class problem
X_data = data["data"][:100]
Y_data = data["target"][:100]
len(X_data)
100
# Splitting the data into training and testing with 80 and 20%
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size = 0.2)
print(X_train.shape, X_test.shape)
(80, 4) (20, 4)
# Making their tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32).t()
Y_train_tensor = torch.tensor(Y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).t()
Y_test_tensor = torch.tensor(Y_test, dtype=torch.float32)
X_train_tensor.shape[1]
80
def init_params(n_x, n_h, n_y):
  '''
    This function is used to initializw the weights for the NN.
    n_x : input units
    n_h : hidden units
    n_y : output units
    It returns a dictionary which contains all the parameters
  '''
  W1 = torch.rand(n_x, n_h)
  b1 = torch.rand(n_h, 1)
  W2 = torch.rand(n_y, n_h)
  b2 = torch.rand(n_y, 1)

  params = {
      'W1': W1,
      'b1': b1,
      'W2': W2,
      'b2': b2
  }
  return params

def compute_cost(y_pred, y_actual):
  '''
    Uses the binary cross entropy loss to compute the cost
  '''
  return -(y_pred.log()* y_actual + (1-y_actual)*(1-y_pred).log()).mean()
  return -1/len(y_pred) * (y_actual * torch.log(y_pred) + (1 - y_actual) * torch.log(1 - y_pred)).sum()

def forward_propagation(params, x_input):
  '''
    Performs the forward propagation step. Uses the parameters to predict the output A2
  '''
  #Extractng the parameters
  W1 = params['W1']
  b1 = params['b1']
  W2 = params['W2']
  b2 = params['b2']

  # Computing the first layer
  Z1 = torch.mm(W1.t(), x_input) + b1
  A1 = torch.sigmoid(Z1)

  # Computing the second layer
  Z2 = torch.mm(W2, A1) + b2
  A2 = torch.sigmoid(Z2)

  # Returning the data
  data = {
      'Z1' : Z1,
      'A1' : A1,
      'Z2' : Z2,
      'A2' : A2
  }

  return A2, data

def back_propagation(params, data, x_input, y_input, learning_rate):
  '''
    Performs the back propagation step. Computes the gradients and updates the parameters
  '''
  m = x_input.shape[1]

  # Extracting the parameters
  W1 = params['W1']
  W2 = params['W2']
  b1 = params['b1']
  b2 = params['b2']

  # Extrcting the required predictions of first and second layers
  A1 = data['A1']
  A2 = data['A2']

  # Calculating the Gradients
  dZ2 = A2 - y_input
  dW2 = 1/m*(torch.mm(dZ2, A1.t()))
  db2 = 1/m*(torch.sum(dZ2, keepdims=True, axis=1))
  dZ1 = torch.mm(W2.t(), dZ2)*(1-torch.pow(A1,2))
  dW1 = 1/m*(torch.mm(dZ1, x_input.t()))
  db1 = 1/m*(torch.sum(dZ1, keepdims=True, axis=1))

  # Updating the parameters
  W1 -= learning_rate*dW1
  W2 -= learning_rate*dW2
  b1 -= learning_rate*db1
  b2 -= learning_rate*db2

  parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

  return parameters
def model(x_input, y_input, learning_rate = 0.01, no_iterations = 20000):
  '''
    Putting everything together and making the model
  '''
  n_x = x_input.shape[0]
  n_h = 4
  n_y = 1
  parameters = init_params(n_x, n_h, n_y)
  costs, iterations = [], []
  for i in range(no_iterations):
    A2, data = forward_propagation(parameters, x_input)
    
    cost = compute_cost(A2, y_input)
    cost_torch = F.binary_cross_entropy(A2, y_input.view(1, 80))

    parameters = back_propagation(parameters, data, x_input, y_input, learning_rate)

    if i%(no_iterations/20) == 0:
      print(f'Cost at iteration {i} is {cost}')
      print(f'Cost_torch at iteration {i} is {cost_torch.item()}')
      
    costs.append(cost)
    iterations.append(i)
  return parameters, costs, iterations
parameters, costs, iterations = model(X_train_tensor, Y_train_tensor, 1e-4, 20000)
plt.plot(iterations, costs)
Cost at iteration 0 is 1.0869792699813843
Cost_torch at iteration 0 is 1.0869792699813843
Cost at iteration 1000 is 1.0277596712112427
Cost_torch at iteration 1000 is 1.0277596712112427
Cost at iteration 2000 is 0.9747225642204285
Cost_torch at iteration 2000 is 0.9747225642204285
Cost at iteration 3000 is 0.9278743863105774
Cost_torch at iteration 3000 is 0.9278743863105774
Cost at iteration 4000 is 0.8870576620101929
Cost_torch at iteration 4000 is 0.8870576620101929
Cost at iteration 5000 is 0.8519670367240906
Cost_torch at iteration 5000 is 0.8519670367240906
Cost at iteration 6000 is 0.8221769332885742
Cost_torch at iteration 6000 is 0.8221769332885742
Cost at iteration 7000 is 0.7971805334091187
Cost_torch at iteration 7000 is 0.7971805334091187
Cost at iteration 8000 is 0.7764268517494202
Cost_torch at iteration 8000 is 0.7764268517494202
Cost at iteration 9000 is 0.7593569755554199
Cost_torch at iteration 9000 is 0.7593569755554199
Cost at iteration 10000 is 0.7454321980476379
Cost_torch at iteration 10000 is 0.7454321980476379
Cost at iteration 11000 is 0.7341533899307251
Cost_torch at iteration 11000 is 0.7341533899307251
Cost at iteration 12000 is 0.7250728011131287
Cost_torch at iteration 12000 is 0.7250728011131287
Cost at iteration 13000 is 0.717799723148346
Cost_torch at iteration 13000 is 0.717799723148346
Cost at iteration 14000 is 0.7119990587234497
Cost_torch at iteration 14000 is 0.7119990587234497
Cost at iteration 15000 is 0.7073894143104553
Cost_torch at iteration 15000 is 0.7073894143104553
Cost at iteration 16000 is 0.703737199306488
Cost_torch at iteration 16000 is 0.703737199306488
Cost at iteration 17000 is 0.700850784778595
Cost_torch at iteration 17000 is 0.700850784778595
Cost at iteration 18000 is 0.6985741853713989
Cost_torch at iteration 18000 is 0.6985741853713989
Cost at iteration 19000 is 0.6967815160751343
Cost_torch at iteration 19000 is 0.6967815160751343
[<matplotlib.lines.Line2D at 0x7fac1775d290>]
_images/042ead6e3ce6ba62dda961452536731bace6eea27b29ac37b9047659cd0bee62.png

Algebra Introduction

This section introduces the basic concepts of algebra, including variables, constants, and functions

Functions

A function is a rule that takes one or more inputs and produces a single output. For example, the function \(f(x) = x + 1\) takes a single input \(x\), adds one to it, and produces a single output. In algebra, functions are written using symbols and formulas. For example, the function \(f(x) = x + 1\) can be written as \(f:x \rightarrow x + 1\). The input to a function is called the argument or input variable. The output is called the value or output variable.

Functions are often written using the following notation:

\[y = f(x)\]

The notation above is read as β€œ\(y\) equals \(f\) of \(x\)” or β€œ\(y\) is a function of \(x\)”. The notation above is useful because it allows us to define a function without specifying its name. For example, we can define a function \(f\) as follows:

\[f(x) = x^2\]

We can then use the function \(f\) to compute the square of any number. For example, \(f(2) = 2^2 = 4\) and \(f(3) = 3^2 = 9\).

\[\begin{split} \mathrm{f}(x) = \sqrt{x + {6}} \\ \mathrm{f}(6) = \sqrt{10 + {6}} \\ \mathrm{f}(6) = 4.0 \end{split}\]
\[\begin{split} \begin{gathered} f(x)=\frac{x-3}{x+2} \\ f(3)=\frac{3-3}{3+2}=\frac{0}{5}=0 \end{gathered} \end{split}\]
Domain and Range of a Function

The domain of a function is the set of all possible inputs to the function. The range of a function is the set of all possible outputs of the function. For example, the function \(f(x) = x^2\) has a domain of all real numbers and a range of all non-negative real numbers. The domain of a function is often written as \(D(f)\) and the range is often written as \(R(f)\).

\[\begin{split} y = f(x) \\ y = x^2 \end{split}\]
import seaborn as sb

func = lambda x: x ** 2

x = [-1,-2,-3, -4, 1, 2, 3, 4]
y = [func(i) for i in x]

sb.lineplot(x=x, y=y)
<Axes: >
_images/75545fb92c47178fa1ea8b40b7deeb97660eec807db642b9a5102537e801cf63.png
Piecewise Functions

A piecewise function is a function that is defined by multiple sub-functions, each sub-function applying to a different interval of the main function’s domain. For example, the function \(f(x) = |x|\) is defined by two sub-functions:

\[\begin{split} f(x) = \begin{cases} x & \text{if } x \geq 0 \\ -x & \text{if } x < 0 \end{cases}\end{split}\]

Expoents

An exponent is a number that indicates how many times a base number is multiplied by itself. For example, \(2^3\) is the same as \(2 \times 2 \times 2\) and \(2^4\) is the same as \(2 \times 2 \times 2 \times 2\). The number \(2\) is called the base and the number \(3\) is called the exponent. Exponents are often written using the following notation:

\[2^3 = 2 \times 2 \times 2 = 8\]

The notation above is read as β€œtwo to the power of three” or β€œtwo cubed”.

Negative Exponents

A negative exponent indicates that the base number should be divided by itself a certain number of times. For example, \(2^{-3}\) is the same as \(\frac{1}{2^3}\) and \(2^{-4}\) is the same as \(\frac{1}{2^4}\). The number \(2\) is called the base and the number \(-3\) is called the exponent. Negative exponents are often written using the following notation:

\[2^{-3} = \frac{1}{2^3} = \frac{1}{8}\]

The notation above is read as β€œtwo to the power of negative three” or β€œtwo to the power of minus three”.

Fractional Exponents

A fractional exponent indicates that the base number should be multiplied by itself a certain number of times. For example, \(2^{\frac{1}{2}}\) is the same as \(\sqrt{2}\) and \(2^{\frac{1}{3}}\) is the same as \(\sqrt[3]{2}\). The number \(2\) is called the base and the number \(\frac{1}{2}\) is called the exponent. Fractional exponents are often written using the following notation:

\[2^{\frac{1}{2}} = \sqrt{2} = 1.414213562373095\]

The notation above is read as β€œtwo to the power of one half” or β€œtwo to the power of one over two”.

Logarithms

A logarithm is the inverse of an exponent. For example, the logarithm of \(2^3\) is \(3\). The logarithm of a number \(x\) to the base \(b\) is written as \(\log_b(x)\). For example, \(\log_2(8) = 3\) because \(2^3 = 8\).

Common Logarithms

The common logarithm of a number \(x\) is the logarithm of \(x\) to the base \(10\). The common logarithm of \(x\) is written as \(\log(x)\). For example, \(\log(100) = 2\) because \(10^2 = 100\).

Natural Logarithms

The natural logarithm of a number \(x\) is the logarithm of \(x\) to the base \(e\). The natural logarithm of \(x\) is written as \(\ln(x)\). For example, \(\ln(100) = 4.60517\) because \(e^{4.60517} = 100\).

Polynomials

A polynomial is an expression consisting of variables and coefficients, that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponents.

For example, \(x^2 + 2x + 1\) is a polynomial because it consists of the variables \(x\) and the coefficients \(1\) and \(2\).

The degree of a polynomial is the highest degree of its terms. For example, the polynomial \(x^2 + 2x + 1\) has a degree of \(2\) because its highest degree term is \(x^2\).

Proof by Induction

A proof by induction consists of two cases. The first, the base case, proves the statement for n = 0 without assuming any knowledge of other cases. The second case, the induction step, proves that if the statement holds for any given case n = k, then it must also hold for the next case n = k + 1. These two steps establish that the statement holds for every natural number n.

What is Machine Learning?

What is Deep Learning

What is machine learning and deep learning

Overview

Deep Learning

a type of machine learning based on artificial neural networks in which multiple layers of processing are used to extract progressively higher level features from data.

Machine Learning

development of computer systems that can learn to more accurately predict the outcomes without following explicit instructions, by using algorithms and statistical models to draw inferences from patterns in data.

Differences between Deep Learning and Machine Learning
Machine Learning
  • uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned.

  • needs a human to identify and hand-code the applied features based on the data type. | tries to learn features extraction and representation as well.

  • tend to parse data in parts, then combined those into a result (e.g. first number plate localization and then recognition).

  • requires relatively less data and training time

Deep learning
  • structures algorithms in layers to create an β€œartificial neural network” that can learn and make intelligent decisions on its own.

  • tries to learn features extraction and representation as well.

  • Deep learning systems look at an entire problem and generate the final result in one go (e.g. outputs the coordinates and the class of object together).

  • requires a lot more data and training time

Applications Of Machine Learning/Deep Learning

  • Email spam detection

  • Fingerprint / face detection & matching (e.g., phones)

  • Web search (e.g., DuckDuckGo, Bing, Google)

  • Sports predictions

  • ATMs (e.g., reading checks)

  • Credit card fraud

  • Stock predictions

Broad categories of Deep learning

Combination

Perceptron

Definition

Simplest artificial neuron that takes binary inputs and based on their weighted sum reaching a threshold, generates a binary output.

Artificial neurons
  • Takes in multiple inputs and learns what should be the appropriate output

  • Essentially a mathematical function where the weights multiplied with the inputs are learnable

  • Acts like a logic gate but the operation performed adjusts according to the data

Combination
  • connect them in a network to create an artificial brain(let)

History of the Perceptron
  • Invented in 1957 by Frank Rosenblatt to binary classify an input data.

  • An attempt to replicate the process and ability of human nervous system.

A Biological Neuron
Combination
McCulloch & Pitts Neuron Model
Combination
Computational Model of a Biological Neuron
Combination
Terminology
  • Net input \(=\) weighted inputs, \(z\)

  • Activations = activation function(net input); \(a=\sigma(z)\)

  • Label output \(=\) threshold(activations of last layer); \(\hat{y}=f(a)\)

Special cases:

  • In perceptron: activation function = threshold function

  • In linear regression: activation \(=\) net input \(=\) output

\[ \sigma\left(\sum_{i=1}^m x_i w_i+b\right)=\sigma\left(\mathbf{x}^T \mathbf{w}+b\right)=\hat{y} \]

Often more convenient notation: define bias unit as \(w_0\) and prepend a 1 to each input vector as an additional feature value

\[ \sigma\left(\sum_{i=0}^m x_i w_i\right)=\sigma\left(\mathbf{x}^{\top} \mathbf{w}\right)=\hat{y} \]
Perceptron Learning Algorithm

Let \(\mathcal{D}=\left(\left\langle\mathbf{x}^{[1]}, y^{[1]}\right\rangle,\left\langle\mathbf{x}^{[2]}, y^{[2]}\right\rangle, \ldots,\left\langle\mathbf{x}^{[n]}, y^{[n]}\right\rangle\right) \in\left(\mathbb{R}^m \times\{0,1\}\right)^n\)

  1. Initialize \(\mathbf{w}:=0^m \quad\) (assume notation where weight incl. bias)

  2. For every training epoch:

    • For every \(\left\langle\mathbf{x}^{[i]}, y^{[i]}\right\rangle \in \mathcal{D}\) :

      1. \(\hat{y}^{[i]}:=\sigma\left(\mathbf{x}^{[i] \top} \mathbf{w}\right)\)

      2. err \(:=\left(y^{[i]}-\hat{y}^{[i]}\right)\)

      3. \(\mathbf{w}:=\mathbf{w}+e r r \times \mathbf{x}^{[i]}\)

Vectorization in Python

Running Computations is a Big Part of Deep Learning!

import torch

def forloop(x, w):
    z = 0.
    for i in range(len(x)):
        z += x[i] * w[i]
    return z


def listcomprehension(x, w):
    return sum(x_i*w_i for x_i, w_i in zip(x, w))


def vectorized(x, w):
    return x.dot(w)


x, w = torch.rand(1000), torch.rand(1000)

%timeit -r 10 -n 10 forloop(x, w)

%timeit -r 10 -n 10 listcomprehension(x, w)

%timeit -r 10 -n 10 vectorized(x, w)
8.34 ms Β± 70.5 Β΅s per loop (mean Β± std. dev. of 10 runs, 10 loops each)
6.84 ms Β± 92.6 Β΅s per loop (mean Β± std. dev. of 10 runs, 10 loops each)
The slowest run took 12.19 times longer than the fastest. This could mean that an intermediate result is being cached.
6.44 Β΅s Β± 10.1 Β΅s per loop (mean Β± std. dev. of 10 runs, 10 loops each)
Perceptron Pytorch Implementation
Label data
import torch
import matplotlib.pyplot as plt

c1_mean , c2_mean = -0.5 , 0.5

c1 = torch.distributions.uniform.Uniform(c1_mean-1,c1_mean+1).sample((200,2))
c2 = torch.distributions.uniform.Uniform(c2_mean-1,c2_mean+1).sample((200,2))
features = torch.cat([c1,c2], axis=0)

labels = torch.cat([torch.zeros((200,1)), torch.ones((200,1))], axis = 0)
data = torch.cat([features, labels],axis=1)

X, y = data[:, :2], data[:, 2]
y = y.to(torch.int)

print('X.shape:', X.shape)
print('y.shape:', y.shape)


X_train, X_test = X[:300], X[100:]
y_train, y_test = y[:300], y[100:]

# Normalize (mean zero, unit variance)
mu, sigma = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mu) / sigma
X_test = (X_test - mu) / sigma

plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], label='class 0', marker='o')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], label='class 1', marker='s')
plt.title('Training set')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend()
plt.show()


plt.scatter(X_test[y_test==0, 0], X_test[y_test==0, 1], label='class 0', marker='o')
plt.scatter(X_test[y_test==1, 0], X_test[y_test==1, 1], label='class 1', marker='s')
plt.title('Test set')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend()
plt.show()
X.shape: torch.Size([400, 2])
y.shape: torch.Size([400])
_images/f4883a0678cf1992089d122cb742713fa3e357ce7fde9c070575ff0033a4c001.png _images/b87084ac852b651c6c14c57dc5ef0200fbdef85e736260ebf70443b2fc00c4fe.png
Train and evaluate
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "mps")

class Perceptron:
    def __init__(self, num_features):
        self.num_features = num_features
        self.weights = torch.zeros(num_features, 1, 
                                   dtype=torch.float32, device=device)
        self.bias = torch.zeros(1, dtype=torch.float32, device=device)
        
        self.ones = torch.ones((1, 1), device=device)
        self.zeros = torch.zeros((1, 1), device=device)

    def forward(self, x):
        linear = torch.mm(x, self.weights) + self.bias
        predictions = torch.where(linear > 0., self.ones, self.zeros)
        return predictions
        
    def backward(self, x, y):  
        predictions = self.forward(x)
        errors = y - predictions
        return errors
        
    def train(self, x, y, epochs):
        for e in range(epochs):
            
            for i in range(y.shape[0]):
                errors = self.backward(x[i].reshape(1, self.num_features), y[i]).reshape(-1)
                self.weights += (errors * x[i]).reshape(self.num_features, 1)
                self.bias += errors
                
    def evaluate(self, x, y):
        predictions = self.forward(x).reshape(-1)
        accuracy = torch.sum(predictions == y).float() / y.shape[0]
        return accuracy



ppn = Perceptron(num_features=2)

X_train_tensor = torch.tensor(X_train, dtype=torch.float32, device=device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32, device=device)

ppn.train(X_train_tensor, y_train_tensor, epochs=5)

print('Model parameters:')
print('Weights: %s' % ppn.weights)
print('Bias: %s' % ppn.bias)

X_test_tensor = torch.tensor(X_test, dtype=torch.float32, device=device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32, device=device)

test_acc = ppn.evaluate(X_test_tensor, y_test_tensor)
print('Test set accuracy: %.2f%%' % (test_acc*100))

Vectors, Matrices, and Tensors

Scalars, vectors, matrices, and tensors are the fundamental data structures of deep learning. In this section, we will briefly review these concepts.

Vectors, Matrices, and Tensors

Scalar

rank-0 tensor
\(x \in \mathbb{R}\)
x = 1.23

Vector

rank-1 tensor
\( x \in \mathbb{R}^nx1 \)

\[\begin{split} \mathbf{x}=\left[\begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array}\right] \end{split}\]

Matrix

rank-2 tensor
\( x \in \mathbb{R}^{nxm} \)

\[\begin{split} \mathbf{X}=\left[\begin{array}{cccc} x_{1,1} & x_{1,2} & \ldots & x_{1, n} \\ x_{2,1} & x_{2,2} & \ldots & x_{2, n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m, 1} & x_{m, 2} & \ldots & x_{m, n} \end{array}\right] \end{split}\]
Tensors Tensors
import torch

t = torch.tensor([ [1, 2, 3, 4,], [6, 7, 8, 9] ])

print(t)
print(t.shape)
print(t.ndim)
print(t.dtype)
tensor([[1, 2, 3, 4],
        [6, 7, 8, 9]])
torch.Size([2, 4])
2
torch.int64

Data onto the GPU

print(torch.cuda.is_available())
print(torch.backends.mps.is_available())

if torch.cuda.is_available():
  t = t.to(torch.device('cuda:0'))
  print(t)
False
False

Broadcasting

Making Vector and Matrix computations more convenient

Computing the Output From Multiple Training Examples at Once
  • The perceptron algorithm is typically considered an β€œonline” algorithm (i.e., it updates the weights after each training example)

  • However, during prediction (e.g., test set evaluation), we could pass all data points at once (so that we can get rid of the β€œforloop”)

  • Two opportunities for parallelism:

    1. computing the dot product in parallel

    2. computing multiple dot products at once

import torch

X = torch.arange(6).view(2, 3)

print(X)

w = torch.tensor([1, 2, 3])

print(w)

print(X.matmul(w))

w = w.view(-1, 1)

print(X.matmul(w))
tensor([[0, 1, 2],
        [3, 4, 5]])
tensor([1, 2, 3])
tensor([ 8, 26])
tensor([[ 8],
        [26]])
Tensors Tensors

This (general) feature is called β€œbroadcasting”

print(torch.tensor([1, 2, 3]) + 1)

t = torch.tensor([[4, 5, 6], [7, 8, 9]])

print(t)

print( t + torch.tensor([1, 2, 3]))
tensor([2, 3, 4])
tensor([[4, 5, 6],
        [7, 8, 9]])
tensor([[ 5,  7,  9],
        [ 8, 10, 12]])

Notational Linear Algebra

Tensors
X = torch.arange(50, dtype=torch.float).view(10, 5)

print(X)

fc = torch.nn.Linear(in_features=5, out_features=3)

print(fc.weight)

print(fc.bias)

print(f"X dim: {X.size()}")
print(f"Weights dim: {fc.weight.size()}")
print(f"bias dim: {fc.bias.size()}")

A = fc(X)

print(f"A dim: {A.size()}")
tensor([[ 0.,  1.,  2.,  3.,  4.],
        [ 5.,  6.,  7.,  8.,  9.],
        [10., 11., 12., 13., 14.],
        [15., 16., 17., 18., 19.],
        [20., 21., 22., 23., 24.],
        [25., 26., 27., 28., 29.],
        [30., 31., 32., 33., 34.],
        [35., 36., 37., 38., 39.],
        [40., 41., 42., 43., 44.],
        [45., 46., 47., 48., 49.]])
Parameter containing:
tensor([[ 0.3764, -0.1360, -0.3212, -0.3096,  0.3564],
        [-0.3580, -0.0980,  0.2572, -0.0816,  0.2260],
        [ 0.4385, -0.3840, -0.3494,  0.3915,  0.1394]], requires_grad=True)
Parameter containing:
tensor([0.1618, 0.4385, 0.3456], requires_grad=True)
X dim: torch.Size([10, 5])
Weights dim: torch.Size([3, 5])
bias dim: torch.Size([3])
A dim: torch.Size([10, 3])

Loss Functions

Introduction

Imagine the scenario, Once you developed your machine learning model that you believe, successfully identifying the cats and dogs but how do you know this is the best result?

we are looking for the metrics or a function that we can use to optimize our model performance.The loss function tells how good your model is in predictions.

  • If the model predictions are closer to the actual values the Loss will be minimum.

  • if the predictions are totally away from the original values the loss value will be the maximum.

\[ Loss = abs(predict – actual) \]

On the basis of the Loss value, you can update your model until you get the best result.

Classification

Cross-Entropy or Log Loss

Cross entropy loss is used mostly when we have a binary classification problem; that is, where the network outputs either 1 or 0.

Suppose we are given a training dataset, \(\mathbb{D}=\left\{\left(x_{i}, y_{i}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}\) and \(y_{i} \in\{0,1\}\).
We can then write this in the following form:

\[ \hat{y}_{i}=f\left(x_{i} ; \theta\right) \]

Here, \(\theta\) is the parameters of the network (weights and biases). We can express this in terms of a Bernoulli distribution, as follows:

\[ P\left(x_{i} \rightarrow y_{i} \mid \theta\right)=\hat{y}_{i}^{y_{i}}\left(1-\hat{y}_{i}\right)^{1-y_{i}} \]

The probability, given the entire dataset, is then as follows:

\[ P\left(x_{1}, \cdots, x_{N}, y_{1}, \cdots, y_{N}\right)=\prod_{i=1}^{N} P\left(x_{i} \rightarrow y_{i} \mid \theta\right)=\prod_{i=1}^{N} \hat{y}_{i}^{y_{i}}\left(1-\hat{y}_{i}\right)^{1-y_{i}} \]

If we take its negative-log likelihood, we get the following:

\[ -\log P\left(x_{1}, \cdots, x_{N}, y_{1}, \cdots, y_{N}\right)=-\log \prod_{i=1}^{N} \hat{y}_{i}^{y_{i}}\left(1-\hat{y}_{i}\right)^{1-y_{i}} \]

So, we have the following:

\[ L(\hat{y}, y)=-\sum_{i=1}^{N} y_{i} \log \hat{y}_{i}+\left(1-y_{i}\right) \log \left(1-\hat{y}_{i}\right) \]
Cross-entropy loss

Cross entropy loss is a metric used to measure how well a classification model in machine learning performs. The loss (or error) is measured as a number between 0 and 1, with 0 being a perfect model. The goal is generally to get your model as close to 0 as possible.

Cross entropy loss measures the difference between the discovered probability distribution of a machine learning classification model and the predicted distribution.

Evaluation Metrics

Evaluation metrics are used to evaluate the performance of a machine learning model. They provide a way to quantitatively measure how well the model is performing on a given task.

Classification

In machine learning, classification is a supervised learning problem in which the model is trained to predict the class of an input data point.

Note

It is important to choose an appropriate evaluation metric for your problem. For example, in a binary classification problem, you may be more interested in minimizing false negatives than false positives, in which case you would want to use a metric like recall rather than precision.

Confusion Matrix

Confusion matrix is a performance measurement for machine learning classification problem where output can be two or more classes.

In a confusion matrix, the rows represent the actual class labels, and the columns represent the predicted class labels.

Confusion Matrix

Here, TP (true positive) is the number of times the classifier predicted β€œpositive” and the actual label was β€œpositive”. FP (false positive) is the number of times the classifier predicted β€œpositive” and the actual label was β€œnegative”. FN (false negative) is the number of times the classifier predicted β€œnegative” and the actual label was β€œpositive”. TN (true negative) is the number of times the classifier predicted β€œnegative” and the actual label was β€œnegative”.

Here is an example of how to compute the values in a confusion matrix:

               Predicted
               Positive  Negative
Actual
Positive        10       5
Negative         3       2

The values in the confusion matrix can be used to compute various performance metrics, such as precision, recall, and accuracy.

Confusion Matrix
Calculate Confusion Matrix for a 2 classes problem
Confusion Matrix

Individual Number

1

2

3

4

5

6

7

8

9

10

11

12

Actual Classification

1

1

1

1

1

1

1

1

0

0

0

0

Predicted Classification

0

0

1

1

1

1

1

1

1

0

0

0

Result

FN

FN

TP

TP

TP

TP

TP

TP

FP

TN

TN

TN

Example from wikipedia.com

import torch
from sklearn import metrics
import matplotlib.pyplot as plt
import torchmetrics

# simulate a classification problem
y_true = torch.randint(0,2, (7,))
y_pred = torch.randint(0,2, (7,))

print(f"{y_true=}")
print(f"{y_pred=}")
print(f"confusion_matrix {metrics.confusion_matrix(y_true, y_pred)} ")

tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_pred).ravel()
print(tn, fp, fn, tp)

disp = metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_true, y_pred) )
disp.plot()
plt.show()

print(f"{torchmetrics.functional.confusion_matrix(y_true, y_pred,task='binary')=}")
y_true=tensor([1, 0, 0, 0, 0, 1, 1])
y_pred=tensor([1, 0, 0, 0, 1, 0, 1])
confusion_matrix [[3 1]
 [1 2]] 
3 1 1 2
_images/4eb65605eacb8ae6d5451a706ad4471838ef3bb3054897fb342b51a95d4afb97.png
torchmetrics.functional.confusion_matrix(y_true, y_pred,task='binary')=tensor([[3, 1],
        [1, 2]])
Confusion Matrix
Precision

The above equation can be explained by saying, from all the classes we have predicted as positive, how many are actually positive. Precision should be high as possible.

\[ \text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]
Recall / Sensitivity / True Positive Rate

Sensitivity tells us what proportion of the positive class got correctly classified.

The above equation can be explained by saying, from all the positive classes, how many we predicted correctly. A simple example would be to determine what proportion of the actual sick people were correctly detected by the model.

Recall should be high as possible.

\[ \text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]

Note

Precision is about your prediction. Recall is about reality.

If your job is to identify thieves.

False Negative Rate

False Negative Rate (FNR) tells us what proportion of the positive class got incorrectly classified by the classifier.

A higher TPR and a lower FNR is desirable since we want to correctly classify the positive class.

\[ \text{F N R} = \frac{\text{FN}}{\text{TP} + \text{FN}} \]
Specificity / True Negative Rate

Specificity tells us what proportion of the negative class got correctly classified.

Taking the same example as in Sensitivity, Specificity would mean determining the proportion of healthy people who were correctly identified by the model.

\[ \text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} \]
False Positive Rate

FPR tells us what proportion of the negative class got incorrectly classified by the classifier.

A higher TNR and a lower FPR is desirable since we want to correctly classify the negative class.

Out of these metrics, Sensitivity and Specificity are perhaps the most important and we will see later on how these are used to build an evaluation metric. But before that, let’s understand why the probability of prediction is better than predicting the target class directly.

\[ \text{F P R} = \frac{\text{FP}}{\text{FP} + \text{TN}} \]
Confusion Matrix

The metrics change with the changing threshold values. We can generate different confusion matrices and compare the various metrics that we discussed in the previous section. But that would not be a prudent thing to do. Instead, what we can do is generate a plot between some of these metrics so that we can easily visualize which threshold is giving us a better result.

The AUC-ROC curve solves just that problem!

print(f"{metrics.precision_score(y_true, y_pred)=}")
print(f"{metrics.recall_score(y_true, y_pred)=}")
print(f"{metrics.accuracy_score(y_true, y_pred)=}")
print(f"{metrics.f1_score(y_true, y_pred)=}")
print(f"{metrics.fbeta_score(y_true, y_pred, beta=0.5)=}")
metrics.precision_score(y_true, y_pred)=0.6666666666666666
metrics.recall_score(y_true, y_pred)=0.6666666666666666
metrics.accuracy_score(y_true, y_pred)=0.7142857142857143
metrics.f1_score(y_true, y_pred)=0.6666666666666666
metrics.fbeta_score(y_true, y_pred, beta=0.5)=0.6666666666666666
F1-score

The F1-score is the harmonic mean of the precision and the recall. Using the harmonic mean has the effect that a good F1-score requires both a good precision and a good recall.

\[ F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \]

It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

Drawbacks

One potential drawback of the F1 score as an evaluation metric is that it is sensitive to imbalanced class distributions. This means that if one class is much more prevalent in the data than the other, the F1 score may not be a reliable indicator of the classifier’s performance.

For example, consider a binary classification problem where the positive class is rare, with only 1% of the samples belonging to that class. In this case, a classifier that simply predicts the negative class all the time would have an F1 score of 0, even though it is making the correct prediction 99% of the time. On the other hand, a classifier that makes a small number of correct predictions for the positive class (e.g., 5 out of 100) would have a relatively high F1 score, even though it is performing poorly overall.

ROC Curve and AUC
Best explaintion
  • https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/

  • https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

  • https://stephenallwright.com/metric-choice/

  • https://mlu-explain.github.io/

Ranking | Recommendation | Information Retrieval

None

Language Model

ROUGE

ROUGE is actually a set of metrics that are used to evaluate the quality of a text summarization system.

ROUGE-N Overlap of n-grams[2] between the system and reference summaries.

ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.

ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.

Linear Algebra

Multiplying Matrices and Vectors

Multiplication of two A x B matrices a third matrix C.

\[C = AB\]

The product operation is defined by

\[C_{i,j} = \sum_{k=1} A_{i,k} B_{k,j}\]

The value at index i,j of result matrix C is given by dot product of ith row of Matrix A with jth column of Matrix

E.g

\[ \begin{align}\begin{aligned}\begin{split}A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}\end{split}\\\begin{split}B = \begin{bmatrix} 2 \\ 6 \end{bmatrix}\end{split}\\C_{1,1} = \sum_{k=1} A_{1,k} B_{k,1} = 1*2 + 2*6 = 14\end{aligned}\end{align} \]

Hadamard product & element-wise product

Element wise of multiplication to generate another matrix of same dimension.

\[ \begin{align}\begin{aligned}C = A * B\\\begin{split}\begin{bmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33} \end{bmatrix} \circ \begin{bmatrix} b_{11} & b_{12} & b_{13}\\ b_{21} & b_{22} & b_{23}\\ b_{31} & b_{32} & b_{33} \end{bmatrix} = \begin{bmatrix} a_{11}\, b_{11} & a_{12}\, b_{12} & a_{13}\, b_{13}\\ a_{21}\, b_{21} & a_{22}\, b_{22} & a_{23}\, b_{23}\\ a_{31}\, b_{31} & a_{32}\, b_{32} & a_{33}\, b_{33} \end{bmatrix}.\end{split}\end{aligned}\end{align} \]

Dot product

The dot product for two vectors to generate scalar value.

\[a \cdot b=\sum_{i=1}^n {a}_i{b}_i={a}_1{b}_1+{a}_2{b}_2+\cdots+{a}_n{b}_n\]

Identity and Inverse Matrices

Identity Matrix

An identity matrix is a matrix that does not change any vector when we multiply that vector by that matrix.

\[\begin{split}I = \begin{bmatrix} 1 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 1 \end{bmatrix}\end{split}\]

When β€˜apply’ the identity matrix to a vector the result is this same vector:

\[ \begin{align}\begin{aligned}I \cdot v = v\\\begin{split}\begin{bmatrix} 1 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 1 \end{bmatrix} \times \begin{bmatrix} x_{1} \\\\ x_{2} \\\\ x_{3} \end{bmatrix}= \begin{bmatrix} 1 \times x_1 + 0 \times x_2 + 0\times x_3 \\\\ 0 \times x_1 + 1 \times x_2 + 0\times x_3 \\\\ 0 \times x_1 + 0 \times x_2 + 1\times x_3 \end{bmatrix}= \begin{bmatrix} x_{1} \\\\ x_{2} \\\\ x_{3} \end{bmatrix}\end{split}\end{aligned}\end{align} \]
Inverse Matrix

The inverse of a matrix \(A\) is written as \(A^{-1}\).

\[{A}^{-1}{A}={I}_n\]

A matrix \(A\) is invertible if and only if there exists a matrix \(B\) such that \(AB = BA = I\).

The inverse can be found using:

  • Gaussian elimination

  • LU decomposition

  • Gauss-Jordan elimination

Singular Matrix

A square matrix that is not invertible is called singular or degenerate. A square matrix is singular if and only if its determinant is zero

Norm

Norm is function which measure the size of vector.

  • Norms are non-negative values. If you think of the norms as a length, you easily see why it can’t be negative.

  • Norms are 0 if and only if the vector is a zero vector

  • The triangle inequality** \(u+v \leq u+v\)

The norm is what is generally used to evaluate the error of a model.

Example

\[ \begin{align}\begin{aligned}u= \begin{bmatrix} 1 & 6 \end{bmatrix}\\v= \begin{bmatrix} 4 & 2 \end{bmatrix}\\u+v = \sqrt{(1+4)^2+(6+2)^2} = \sqrt{89} \approx 9.43\\u+v = \sqrt{1^2+6^2}+\sqrt{4^2+2^2} = \sqrt{37}+\sqrt{20} \approx 10.55\end{aligned}\end{align} \]

The p-norm (also called \(\ell_p\)) of vector x. Let p β‰₯ 1 be a real number.

\[ \begin{align}\begin{aligned}\left\|x\right\|_p := \left( \sum_{i=1}^n \left|x_i\right|^p\right)^{1/p}\\\left\| x \right\| _p = \left( |x_1|^p + |x_2|^p + \dotsb + |x_n|^p \right) ^{1/p}\end{aligned}\end{align} \]
  • L1 norm, Where p = 1 \(\| x \|_1 = \sum_{i=1}^n |x_i|\)

  • L2 norm and euclidean norm, Where p = 2 \(\left\| x \right\|_2 = \sqrt{\sum_{i=1}^n x_i^2}\)

  • L-max norm, Where p = infinity

\[ \begin{align}\begin{aligned}\begin{split}u=\begin{bmatrix} 3 \\\\ 4 \end{bmatrix}\end{split}\\u =\sqrt{|3|^2+|4|^2} =\sqrt{25} =5\end{aligned}\end{align} \]
Frobenius norm

Sometimes we may also wish to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the Frobenius norm.

The Frobenius norm is the square root of the sum of the squares of all the elements of a matrix.

\[ \begin{align}\begin{aligned}\|A\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{ij}^2}\\\|A\|_\text{F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2}\end{aligned}\end{align} \]
The squared Euclidean norm

The squared L^2 norm is convenient because it removes the square root and we end up with the simple sum of every squared values of the vector.

The squared Euclidean norm is widely used in machine learning partly because it can be calculated with the vector operation \(x^Tx\). There can be performance gain due to the optimization

\[ \begin{align}\begin{aligned}\begin{split}x=\begin{bmatrix} 2 \\\\ 5 \\\\ 3 \\\\ 3 \end{bmatrix}\end{split}\\x^T=\begin{bmatrix} 2 & 5 & 3 & 3 \end{bmatrix}\\\begin{split}x^Tx=\begin{bmatrix} 2 & 5 & 3 & 3 \end{bmatrix} \times \begin{bmatrix} 2 \\\\ 5 \\\\ 3 \\\\ 3 \end{bmatrix}\\\\ &= 2\times 2 + 5\times 5 + 3\times 3 + 3\times 3= 47\end{split}\end{aligned}\end{align} \]

The Trace Operator

The sum of the elements along the main diagonal of a square matrix.

\[ \begin{align}\begin{aligned}\operatorname{Tr}(A) = \sum_{i=1}^n a_{ii} = a_{11} + a_{22} + \dots + a_{nn}\\\begin{split}A=\begin{bmatrix} 2 & 9 & 8 \\\\ 4 & 7 & 1 \\\\ 8 & 2 & 5 \end{bmatrix}\end{split}\\\mathrm{Tr}(A) = 2 + 7 + 5 = 14\end{aligned}\end{align} \]

Satisfies the following properties:

\[ \begin{align}\begin{aligned}\text{tr}(A) = \text{tr}(A^T)\\\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)\\\text{tr}(cA) = c\text{tr}(A)\end{aligned}\end{align} \]

Transpose

\[(A^T)_{ij} = A_{ji}\]

Satisfies the following properties:

\[(A+B)^T = A^T + B^T (AB)^T = B^TA^T (A^T)^{-1} = (A^{-1})^T\]

Diagonal matrix

A matrix where \(A_{ij} = 0\) if \(i \neq j\).

Can be written as \(\text{diag}(a)\) where \(a\) is a vector of values specifying the diagonal entries.

Diagonal matrices have the following properties:

\[ \begin{align}\begin{aligned}\text{diag}(a) + \text{diag}(b) = \text{diag}(a + b)\\\text{diag}(a) \cdot \text{diag}(b) = \text{diag}(a * b)\\\text{diag}(a)^{-1} = \text{diag}(a_1^{-1},...,a_n^{-1})\\\text{det}(\text{diag}(a)) = \prod_i{a_i}\end{aligned}\end{align} \]

Example

\[\begin{split}\begin{bmatrix} 1 & 0 & 0\\ 0 & 4 & 0\\ 0 & 0 & -3\\ 0 & 0 & 0\\ \end{bmatrix} or \begin{bmatrix} 1 & 0 & 0 & 0 & 0\\ 0 & 4 & 0& 0 & 0\\ 0 & 0 & -3& 0 & 0\end{bmatrix}\end{split}\]

The eigenvalues of a diagonal matrix are the set of its values on the diagonal.

Symmetric matrix

A square matrix \(A\) where \(A = A^T\).

\[\begin{split}\begin{bmatrix} 1 & 7 & 3 \\ 7 & 4 & 5 \\ 3 & 5 & 0 \end{bmatrix} = A^T = A\end{split}\]

Some properties of symmetric matrices are:

  • All the eigenvalues of the matrix are real.

Unit Vector

A unit vector has unit Euclidean norm.

\[ \begin{align}\begin{aligned}\|x\|_2 := \sqrt{x_1^2 + \cdots + x_n^2} = 1\\\begin{split}\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} = \sqrt{1^2 + 0^2 + 0^2} = 1\end{split}\end{aligned}\end{align} \]

Orthogonal Matrix or Orthonormal Vectors

Orthogonal Vectors

Two vector x and y are orthogonal if they are perpendicular to each other or dot product is equal to zero.

\[ \begin{align}\begin{aligned}\begin{split}x=\begin{bmatrix} 2\\\\ 2 \end{bmatrix}\end{split}\\\begin{split}y=\begin{bmatrix} 2\\\\ -2 \end{bmatrix}\end{split}\\\begin{split}x^Ty= \begin{bmatrix} 2 & 2 \end{bmatrix} \begin{bmatrix} 2\\\\ -2 \end{bmatrix}= \begin{bmatrix} 2\times2 + 2\times-2 \end{bmatrix}=0\end{split}\end{aligned}\end{align} \]
Orthonormal Vectors

when the norm of orthogonal vectors is the unit norm they are called orthonormal.

Orthonormal Matrix

Orthogonal matrices are important because they have interesting properties. A matrix is orthogonal if columns are mutually orthogonal and have a unit norm (orthonormal) and rows are mutually orthonormal and have unit norm.

An orthogonal matrix is a square matrix whose columns and rows are orthonormal vectors.

\[ \begin{align}\begin{aligned}A^\mathrm{T} A = A A^\mathrm{T} = I\\A^\mathrm{T}=A^{-1}\end{aligned}\end{align} \]

where AT is the transpose of A and I is the identity matrix. This leads to the equivalent characterization: matrix A is orthogonal if its transpose is equal to its inverse.

so orthogonal matrices are of interest because their inverse is very cheap to compute.

Property 1

A orthogonal matrix has this property: \(A^T A = I\).

\[ \begin{align}\begin{aligned}\begin{split}A=\begin{bmatrix} a & b\\\\ c & d \end{bmatrix} & A^T=\begin{bmatrix} a & c\\\\ b & d \end{bmatrix}\end{split}\\\begin{split}A^TA=\begin{bmatrix} a & c\\\\ b & d \end{bmatrix} \begin{bmatrix} a & b\\\\ c & d \end{bmatrix} = \begin{bmatrix} aa + cc & ab + cd\\\\ ab + cd & bb + dd \end{bmatrix}\\\\ &= \begin{bmatrix} a^2 + c^2 & ab + cd\\\\ ab + cd & b^2 + d^2 \end{bmatrix}\end{split}\\\begin{split} A^TA=\begin{bmatrix} 1 & ab + cd\\\\ ab + cd & 1 \end{bmatrix}\end{split}\\\begin{split} \begin{bmatrix} a & c \end{bmatrix} \begin{bmatrix} b\\\\ d \end{bmatrix} = ab+cd\end{split}\\\begin{split}\begin{bmatrix} a & c \end{bmatrix} \begin{bmatrix} b\\\\ d \end{bmatrix}=0\end{split}\\\begin{split}A^TA=\begin{bmatrix} 1 & 0\\\\ 0 & 1 \end{bmatrix}\end{split}\end{aligned}\end{align} \]

that the norm of the vector \(\begin{bmatrix} a & c \end{bmatrix}\) is equal to \(a^2+c^2\) (squared L^2). In addtion, we saw that the rows of A have a unit norm because A is orthogonal. This means that \(a^2+c^2=1\) and \(b^2+d^2=1\).

Property 2

We can show that if \(A^TA=I\) then \(A^T=A^{-1}\)

\[ \begin{align}\begin{aligned}(A^TA)A^{-1}={I}A^{-1}\\(A^TA)A^{-1}=A^{-1}\\A^TAA^{-1}=A^{-1}\\A^TI=A^{-1}\\A^T=A^{-1}\end{aligned}\end{align} \]

You can refer to [this question](https://math.stackexchange.com/questions/1936020/why-is-the-inverse-of-an-orthogonal-matrix-equal-to-its-transpose).

Sine and cosine are convenient to create orthogonal matrices. Let’s take the following matrix:

\[\begin{split}A=\begin{bmatrix} cos(50) & -sin(50)\\\\ sin(50) & cos(50) \end{bmatrix}\end{split}\]

Eigendecomposition

The eigendecomposition is one form of matrix decomposition (only square matrices). Decomposing a matrix means that we want to find a product of matrices that is equal to the initial matrix. In the case of the eigendecomposition, we decompose the initial matrix into the product of its eigenvectors and eigenvalues.

\[A v = \lambda v\]
Eigenvectors and eigenvalues

Now imagine that the transformation of the initial vector gives us a new vector that has the exact same direction. The scale can be different but the direction is the same. Applying the matrix didn’t change the direction of the vector. This special vector is called an eigenvector of the matrix. We will see that finding the eigenvectors of a matrix can be very useful. Imagine that the transformation of the initial vector by the matrix gives a new vector with the exact same direction. This vector is called an eigenvector of 𝐴 . This means that 𝑣 is a eigenvector of 𝐴 if 𝑣 and 𝐴𝑣 are in the same direction or to rephrase it if the vectors 𝐴𝑣 and 𝑣 are parallel. The output vector is just a scaled version of the input vector. This scalling factor is πœ† which is called the eigenvalue of 𝐴 .

\[ \begin{align}\begin{aligned}\begin{split}A=\begin{bmatrix} 5 & 1\\\\ 3 & 3 \end{bmatrix}\end{split}\\\begin{split}v=\begin{bmatrix} 1\\\\ 1 \end{bmatrix}\end{split}\\Av = \lambda v\\\begin{split}\begin{bmatrix} 5 & 1\\\\ 3 & 3 \end{bmatrix} \begin{bmatrix} 1\\\\ 1 \end{bmatrix}=\begin{bmatrix} 6\\\\ 6 \end{bmatrix}\end{split}\\\begin{split}6\times \begin{bmatrix} 1\\\\ 1 \end{bmatrix} = \begin{bmatrix} 6\\\\ 6 \end{bmatrix}\end{split}\end{aligned}\end{align} \]

which means that v is well an eigenvector of A. Also, the corresponding eigenvalue is lambda=6.

Another eigenvector of 𝐴 is

\[ \begin{align}\begin{aligned}\begin{split}v=\begin{bmatrix} 1\\\\ -3 \end{bmatrix}\end{split}\\\begin{split}\begin{bmatrix} 5 & 1\\\\ 3 & 3 \end{bmatrix}\begin{bmatrix} 1\\\\ -3 \end{bmatrix} = \begin{bmatrix} 2\\\\ -6 \end{bmatrix}\end{split}\\\begin{split}2 \times \begin{bmatrix} 1\\\\ -3 \end{bmatrix} = \begin{bmatrix} 2\\\\ -6 \end{bmatrix}\end{split}\end{aligned}\end{align} \]

which means that v is an eigenvector of A. Also, the corresponding eigenvalue is lambda=2.

Rescaled vectors if v is an eigenvector of A, then any rescaled vector sv is also an eigenvector of A. The eigenvalue of the rescaled vector is the same.

\[ \begin{align}\begin{aligned}\begin{split}3v=\begin{bmatrix} 3\\\\ -9 \end{bmatrix}\end{split}\\\begin{split}\begin{bmatrix} 5 & 1\\\\ 3 & 3 \end{bmatrix} \begin{bmatrix} 3\\\\ -9 \end{bmatrix} = \begin{bmatrix} 6\\\\ -18 \end{bmatrix} = 2 \times \begin{bmatrix} 3\\\\ -9 \end{bmatrix}\end{split}\end{aligned}\end{align} \]

We have well A X 3v = lambda v and the eigenvalue is still lambda = 2 .

Concatenating eigenvalues and eigenvectors

Now that we have an idea of what eigenvectors and eigenvalues are we can see how it can be used to decompose a matrix. All eigenvectors of a matrix 𝐴 can be concatenated in a matrix with each column corresponding to each eigenvector.

\[\begin{split}v=\begin{bmatrix} 1 & 1\\\\ 1 & -3 \end{bmatrix}\end{split}\]

The first column [ 1 1 ] is the eigenvector of 𝐴 with lambda=6 and the second column [ 1 -3 ] with lambda=2.

The vector \(\lambda\) can be created from all eigenvalues:

\[\begin{split}\lambda= \begin{bmatrix} 6\\\\ 2 \end{bmatrix}\end{split}\]

Then the eigendecomposition is given by

\[A=V\cdot diag(\lambda) \cdot V^{-1}\]

Converting eigenvalues and eigenvectors to a matrix A.

\[ \begin{align}\begin{aligned}\begin{split}V^{-1}=\begin{bmatrix} 0.75 & 0.25\\\\ 0.25 & -0.25 \end{bmatrix}\end{split}\\\begin{split}&V\cdot diag(\lambda) \cdot V^{-1}\\\\ &= \begin{bmatrix} 1 & 1\\\\ 1 & -3 \end{bmatrix} \begin{bmatrix} 6 & 0\\\\ 0 & 2 \end{bmatrix} \begin{bmatrix} 0.75 & 0.25\\\\ 0.25 & -0.25 \end{bmatrix}\end{split}\\\begin{split}\begin{bmatrix} 1 & 1\\\\ 1 & -3 \end{bmatrix} \begin{bmatrix} 6 & 0\\\\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 6 & 2\\\\ 6 & -6 \end{bmatrix}\end{split}\\\begin{split}&\begin{bmatrix} 6 & 2\\\\ 6 & -6 \end{bmatrix} \begin{bmatrix} 0.75 & 0.25\\\\ 0.25 & -0.25 \end{bmatrix}\\\\ &= \begin{bmatrix} 6\times0.75 + (2\times0.25) & 6\times0.25 + (2\times-0.25)\\\\ 6\times0.75 + (-6\times0.25) & 6\times0.25 + (-6\times-0.25) \end{bmatrix}\\\\ &= \begin{bmatrix} 5 & 1\\\\ 3 & 3 \end{bmatrix}= A\end{split}\end{aligned}\end{align} \]
Real symmetric matrix

In the case of real symmetric matrices, the eigendecomposition can be expressed as

\[A = Q\Lambda Q^T\]

where \(Q\) is the matrix with eigenvectors as columns and \(\Lambda\) is \(diag(\lambda)\).

\[\begin{split}A=\begin{bmatrix} 6 & 2\\\\ 2 & 3 \end{bmatrix}\end{split}\]

This matrix is symmetric because \(A=A^T\). Its eigenvectors are:

\[ \begin{align}\begin{aligned}\begin{split}Q=\begin{bmatrix} 0.89442719 & -0.4472136\\\\ 0.4472136 & 0.89442719 \end{bmatrix}\end{split}\\\begin{split}\Lambda=\begin{bmatrix} 7 & 0\\\\ 0 & 2 \end{bmatrix}\end{split}\\\begin{split} Q\Lambda&=\begin{bmatrix} 0.89442719 & -0.4472136\\\\ 0.4472136 & 0.89442719 \end{bmatrix} \begin{bmatrix} 7 & 0\\\\ 0 & 2 \end{bmatrix}\\\\ &= \begin{bmatrix} 0.89442719 \times 7 & -0.4472136\times 2\\\\ 0.4472136 \times 7 & 0.89442719\times 2 \end{bmatrix}\\\\ &= \begin{bmatrix} 6.26099033 & -0.8944272\\\\ 3.1304952 & 1.78885438 \end{bmatrix}\end{split}\\\begin{split}Q^T=\begin{bmatrix} 0.89442719 & 0.4472136\\\\ -0.4472136 & 0.89442719 \end{bmatrix}\end{split}\\\begin{split} Q\Lambda Q^T&=\begin{bmatrix} 6.26099033 & -0.8944272\\\\ 3.1304952 & 1.78885438 \end{bmatrix} \begin{bmatrix} 0.89442719 & 0.4472136\\\\ -0.4472136 & 0.89442719 \end{bmatrix}\\\\ &= \begin{bmatrix} 6 & 2\\\\ 2 & 3 \end{bmatrix}\end{split}\end{aligned}\end{align} \]

Singular Value Decomposition

The eigendecomposition can be done only for square matrices. The way to go to decompose other types of matrices that can’t be decomposed with eigendecomposition is to use Singular Value Decomposition (SVD).

SVD decompose 𝐴 into 3 matrices.

\(A = U D V^T\)

U,D,V where U is a matrix with eigenvectors as columns and D is a diagonal matrix with eigenvalues on the diagonal and V is the transpose of U.

The matrices U,D,V have the following properties:

  • U and V are orthogonal matrices U^T=U^{-1} and V^T=V^{-1}

  • D is a diagonal matrix However D is not necessarily square.

  • The columns of U are called the left-singular vectors of A while the columns of V are the right-singular vectors of A.The values along the diagonal of D are the singular values of A.

Intuition

I think that the intuition behind the singular value decomposition needs some explanations about the idea of matrix transformation. For that reason, here are several examples showing how the space can be transformed by 2D square matrices. Hopefully, this will lead to a better understanding of this statement: 𝐴 is a matrix that can be seen as a linear transformation. This transformation can be decomposed in three sub-transformations: 1. rotation, 2. re-scaling, 3. rotation. These three steps correspond to the three matrices π‘ˆ , 𝐷 , and 𝑉.

SVD and eigendecomposition

Now that we understand the kind of decomposition done with the SVD, we want to know how the sub-transformations are found. The matrices π‘ˆ , 𝐷 and 𝑉 can be found by transforming 𝐴 in a square matrix and by computing the eigenvectors of this square matrix. The square matrix can be obtain by multiplying the matrix 𝐴 by its transpose in one way or the other:

π‘ˆ corresponds to the eigenvectors of 𝐴𝐴^T 𝑉 corresponds to the eigenvectors of 𝐴^T𝐴 𝐷 corresponds to the eigenvalues 𝐴𝐴^T or 𝐴^T𝐴 which are the same.

The Moore-Penrose Pseudoinverse

We saw that not all matrices have an inverse because the inverse is used to solve system of equations. The Moore-Penrose pseudoinverse is a direct application of the SVD. the inverse of a matrix A can be used to solve the equation Ax=b.

\[A^{-1}Ax=A^{-1}b I_nx=A^{-1}b x=A^{-1}b\]

But in the case where the set of equations have 0 or many solutions the inverse cannot be found and the equation cannot be solved. The pseudoinverse is \(A^+\) where \(A^+\) is the pseudoinverse of \(A\).

\[ \begin{align}\begin{aligned}AA^+\approx I_n\\|| AA^+-I_n ||\end{aligned}\end{align} \]

The following formula can be used to find the pseudoinverse:

\[A^+= VD^+U^T\]

Principal Components Analysis (PCA)

The aim of principal components analysis (PCA) is generaly to reduce the number of dimensions of a dataset where dimensions are not completely decorelated.

Describing the problem

The problem can be expressed as finding a function that converts a set of data points from ℝ𝑛 to ℝ𝑙 . This means that we change the number of dimensions of our dataset. We also need a function that can decode back from the transformed dataset to the initial one.

_images/principal-components-analysis-PCA-change-coordinates.png

The first step is to understand the shape of the data. π‘₯(𝑖) is one data point containing 𝑛 dimensions. Let’s have π‘š data points organized as column vectors (one column per point):

\(x=\begin{bmatrix} x^{(1)} x^{(2)} \cdots x^{(m)} \end{bmatrix}\)

If we deploy the n dimensions of our data points we will have:

\[\begin{split}x=\begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(m)}\\\\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(m)}\\\\ \cdots & \cdots & \cdots & \cdots\\\\ x_n^{(1)} & x_n^{(2)} & \cdots & x_n^{(m)} \end{bmatrix}\end{split}\]

We can also write:

\[\begin{split}x=\begin{bmatrix} x_1\\\\ x_2\\\\ \cdots\\\\ x_n \end{bmatrix}\end{split}\]

c will have the shape:

\[\begin{split}c=\begin{bmatrix} c_1\\\\ c_2\\\\ \cdots\\\\ c_l \end{bmatrix}\end{split}\]
Adding some constraints: the decoding function

The encoding function f(x) transforms x into c and the decoding function transforms back c into an approximation of x. To keep things simple, PCA will respect some constraints:

Constraint 1

The decoding function has to be a simple matrix multiplication:

$$g(c)=Dc$$

By applying the matrix D to the dataset from the new coordinates system we should get back to the initial coordinate system.

Constraint 2

The columns of D must be orthogonal.

Constraint 3

The columns of D must have unit norm.

Finding the encoding function

For now we will consider only one data point. Thus we will have the following dimensions for these matrices (note that x and c are column vectors)

_images/principal-components-analysis-PCA-encoding-function.png

We want a decoding function which is a simple matrix multiplication. For that reason, we have g(c)=Dc.

We will then find the encoding function from the decoding function. We want to minimize the error between the decoded data point and the actual data point.

With our previous notation, this means reducing the distance between x and g(c). As an indicator of this distance, we will use the squared L^2 norm.

\[||x-g(c)||_2^2\]

This is what we want to minimize. Let’s call \(c^*\) the optimal c. Mathematically it can be written:

$$ c^* = argmin ||x-g(c)||_2^2 $$

This means that we want to find the values of the vector c such that \(||x-g(c)||_2^2\) is as small as possible.

the squared \(L^2\) norm can be expressed as:

$$ ||y||_2^2 = y^Ty $$

We have named the variable y to avoid confusion with our x. Here \(y=x - g(c)\) Thus the equation that we want to minimize becomes:

$$ (x - g(c))^T(x - g(c)) $$

Since the transpose respects addition we have:

$$ (x^T - g(c)^T)(x - g(c)) $$

By the distributive property we can develop:

$$ x^Tx - x^Tg(c) - g(c)^Tx + g(c)^Tg(c) $$

The commutative property tells us that \(x^Ty = y^Tx\). We can use that in the previous equation: we have \(g(c)^Tx = x^Tg(c)\). So the equation becomes:

$$ x^Tx -x^Tg(c) -x^Tg(c) + g(c)^Tg(c) = x^Tx -2x^Tg(c) + g(c)^Tg(c) $$

The first term \(x^Tx\) does not depends on c and since we want to minimize the function according to c we can just get off this term. We simplify to:

$$ c^* = arg min -2x^T g(c) + g(c)^T g(c) $$

Since \(g(c)=Dc\):

$$ c^* = arg min -2x^T Dc + (Dc)^T Dc $$

With \((Dc)^T = c^T D^T\), we have:

$$ c^* = arg min -2x^T Dc + c^T D^T Dc $$

As we knew, \(D^T D= I_l\) because D is orthogonal. and their columns have unit norm. We can replace in the equation:

$$ c^* = arg min -2x^T Dc + c^T I_l c $$

$$ c^* = arg min -2x^T Dc + c^T c $$

Minimizing the function

Now the goal is to find the minimum of the function βˆ’2π‘₯T𝐷𝑐+𝑐T𝑐. One widely used way of doing that is to use the gradient descent algorithm. The main idea is that the sign of the derivative of the function at a specific value of π‘₯ tells you if you need to increase or decrease π‘₯ to reach the minimum. When the slope is near 0 , the minimum should have been reached.

Its mathematical notation is βˆ‡π‘₯𝑓(π‘₯).

Here we want to minimize through each dimension of c. We are looking for a slope of 0.

Statistics

Mean, Variance and Standard Deviation

Mean

The mean of a vector, usually denoted as \(\mu\) , is the mean of its elements, that is to say the sum of the components divided by the number of components

\[\bar{x} = \mu = \frac{1}{n} \sum_{i=1}^n x_i\]
Variance

The variance is the mean of the squared differences to the mean.

\[var(x) = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2\]

with \(var(x)\) being the variance of the variable \(x\), \(n\) the number of data samples, \(x_i\) the ith data sample and \(\bar{x}\) the mean of \(x\).

Standard Deviation

The standard deviation is simply the square root of the variance. It is usually denoted as \(\sigma\):

\[\sigma(x) = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2}\]

We square root the variance to go back to the units of the observations.

Both the variance and the standard deviation are dispersion indicators: they tell you if the observations are clustered around the mean.

Note also that the variance and the standard deviation are always positive (it is like a distance, measuring how far away the data points are from the mean):

\[ \begin{align}\begin{aligned}var(x) \geq 0\\\sigma(x) \geq 0\end{aligned}\end{align} \]

Covariance and Correlation

\[cov(x, y) = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\]
Correlation

The correlation, usually refering to the Pearson’s correlation coefficient, is a normalized version of the covariance. It is scaled between -1 and 1

\[corr(x, y) = \frac{cov(x, y)}{\sigma_x \sigma_y}\]

Indices and tables