Welcome to my Machine Learning notesο
I am Ikram Ali, I specialize in building ML engineering and data science teams from the ground up. My current passion is NLP and MLOps
I approach my career by purposefully building domain knowledge in all the cross-functional disciplines required to deliver successful Data Science projects. This includes Research, Data Engineering, Machine Learning Engineering, Management, as well as dabbling in Agile Program Management and Product Management within other roles. This allows me to lead cross-functional teams and know the pain points of bringing a model from ideation to production. For further information regarding my credentials, please refer to my LinkedIn and Github profiles
I would like to offer concise definitions and comprehensible explanations of Machine Learning and Deep Learning.
Contentsο
ML Notation / Equationsο
Notationο
Symbol |
Formula |
Explained |
---|---|---|
\(\mu\) |
\(\sum_{x} k P(X=x) = \int_{-\infty}^{\infty} x f(x) d x\) |
|
\(V(X)\) or \(\sigma^2\) |
\(E[(X - E[X])^2] = E[(X - \mu)^2] = E[X^2] - E[X]^2\) |
|
\(\sigma\) |
\(\sqrt{V(X)}\) |
Standard deviation |
\(Cov(X,Y)\) |
Covariance of X and Y |
Covariance of X and Y |
\(\bar{X}\) |
The sample |
The sample mean is an average value |
\(\delta\) |
\(\delta(v)\) |
Activation fucntions, sigmoid, relu, etc. |
Equationsο
Cosine Similarityο
Cosine similarity is a metric used to measure the similarity between two vectors in a multi-dimensional space. Cosine similarity measures the cosine of the angle between two non-zero vectors in an n-dimensional space.
Formula = dot product / normalized sum of squares
Propertiesο
Scale Invariance Cosine similarity is scale-invariant, meaning it is not affected by the magnitude of the vectors, only by their orientations.
One hot and multi hot vectors easily.
import torch
from torch.nn import functional as F
v1 = torch.tensor([0, 0, 1], dtype=torch.float32)
v2 = torch.tensor([0, 1, 1],dtype=torch.float32)
print(F.cosine_similarity(v1, v2 , dim=0))
print(F.normalize(v1, dim=0) @ F.normalize(v2, dim=0))
print(torch.norm(v1) / torch.norm(v2))
print( torch.matmul(v1, v2.T) / ( torch.sqrt( torch.sum(v1 ** 2)) * torch.sqrt( torch.sum(v2 ** 2))) )
tensor(0.7071)
tensor(0.7071)
tensor(0.7071)
tensor(0.7071)
/tmp/ipykernel_1144/3833807425.py:10: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3637.)
print( torch.matmul(v1, v2.T) / ( torch.sqrt( torch.sum(v1 ** 2)) * torch.sqrt( torch.sum(v2 ** 2))) )
What is Probabilityο
Definitionο
Probability is the branch of mathematics that deals with the occurrence of a random event.
Probability is the measure of the likelihood of an event to happen.
Probability is the study of randomness and uncertainty. Probability theory is widely used in the area of studies such as statistics, finance, gambling, artificial intelligence, machine learning, computer science, game theory, and philosophy.

Applications of probabilityο
Some of the applications of probability are predicting results of the following events:
Minor
that a customer will buy milk if they are also buying bread.
Of getting at least 2 heads in 5 coin flips.
Getting 3 and 5 on throwing a die.
Pulling a green candy from a bag of red candies.
Winning a lottery 1 in many millions.
# of customers arriving at a bank in a week
Major
It is used for risk assessment and modelling in various industries
Weather forecasting or prediction of weather changes
Probability of a team winning in a sport based on players and strength of team
In the share market, chances of getting the hike of share prices
Probability Terminologyο
The first thing we do when we start thinking about the probability list a number of things that could possibly happen.
Some of the important probability terms are discussed here.
Sample Spaceο
Sample space of an experiment, denoted \(S\), is the set of all possible outcomes of an experiment or trial.
Suppose that we toss a die. Six numbers, from 1 to 6, can appear face up, but we do not yet know which one of them will appear. The sample space is S = {1,2,3,4,5,6}.
For tossing is a fair coin, the sample space is S = {H, T}
Image from byjus.com & medium.com
Experiment or Trialο
Experiment is any action or process that generates observations or outcomes.\ E.g. The tossing of a coin, selecting a card from a deck of cards, throwing a dice etc.
Outcome or Sample Pointο
An outcome is a possible result of an experiment or trial.
E.g. The outcome of tossing a coin is a head or a tail.
Roll a die, the outcome is a number between 1 and 6. Each of the six numbers is a sample point
Eventο
Event is any possible outcome, or combination of outcomes, of an experiment.
E.g. Getting a Head while tossing a coin is an event.
Cardinalityο
Cardinality of a sample space or an event, is the number of outcomes it contains. \(|S|\) represents the cardinality of the sample space.
Tossing a coin, \(|S|\) = 2, Rolling a die, \(|S|\) = 6
Flip a coin twice, S = {00,01,11,10} \(|S|\) = 4
Flip a coin until you get a tail. \(S=\{1,01,001,0001, \ldots\}\) \(|S| = \infty\)
Populationο
Those individuals or objects from which we want to acquire information or draw a conclusion.
E.g. All valves produced by a specific manufacturing plant.
All adult females in the United States.
All smokers
Sampleο
Most of the time, the population is so large, we can only collect data on a subset of it. We will call this our sample.
Sets and Subsetsο
A set is defined as a group of objects (i.e., sets can be made up of letters, numbers, names, etc.).
A subset is defined as a set within a set. set A is a subset of set B if and only if every element of A is also in B.
Empty Setο
The set that contains nothing, denoted \(\emptyset\).
Complementο
\(A^c\) = A complement. This is a shorthand way of saying when A does not occur. This set is made up of everything not in A.
Parameterο
Parameters are the unknown values of an entire population, such as the mean and standard deviation. Samples can estimate population parameters but their exact values are usually unknowable.
Interview Question
Q: What is the sample space of rolling Two Dice?
Ans: The total number of joint outcomes (a,b) is 6 times 6 which is 36.
Axioms of Probabilityο
- Axiom 1:
For any event, βAβ the probability of possible outcomes is either 0 or 1, where 0 is the event which never occurs, and 1 is the event will certainly occur. For any event \(A, 0 \leq P(A) \leq 1\).
- Axiom 2:
The sum of probabilities of all possible outcomes is 1.Probability of the sample space S is \(P(S)=1\).
- Axiom 3:
If \(A_{n}\) mutually exclusive events (intersection of any two is the empty set) then \(P\left(\bigcup_{i = 1}^k A_n\right) = \sum_{k=1}^{n} P\left(A_{k}\right)\)
- Axiom 4:
The complement of any event A is the event that consists of all the outcomes that are not in A.
- Axiom 5:
If both A and B are independent, then the conditional probability that event B occurs given that event A has already occurred. P ( A and B) = P (A) P (B | A). This is called the General rule of multiplication.
Countingο
Despite the trivial name of this topic, be assured that learning to count is not as easy as it sounds.
Naive Probabilityο
The probability of an event occurring, if the likelihood of each outcome is equal, is:
When we are working with probabilities, our notation will be P(A). this means the Probability that event A occurred. So, if A is the event of flipping heads in one flip of a fair coin, then P(A) = .5
This Naive Definition is a reasonable place to start, because itβs likely how you have calculated probabilities up to
this point. Of course, this is not always the correct approach for real world probabilities (hence the name naive
).
Multiplication Ruleο
To understand the Multiplication Rule, visualize a process that has multiple steps, where each step has multiple choices. For example, say that you are ordering a pizza.
Size (small, medium, or large)
Topping (pepperoni, meatball, sausage, extra cheese)
Order Type (delivery or pickup)
Using the multiplication rule, we can easily count the number of distinct pizzas that you could possibly order. Since there are 3 choices for size, 4 choices for toppings, and 2 choices for pickup.
we simply have 3 β 4 β 2 = 24 different pizza options.
Now that we have counted the total of number of possible pizzas, it is easy to solve various probability problems.
Interview Question
Q: What are the outcomes of flipping a fair coin and simultaneously rolling a fair die?
Ans: 6 x 2 = 12 outcomes.
Q: How many possible license plates could be stamped if each license plate were required to have exactly 3 letters
and 4 numbers?
Ans: 26 x 26 x 26 x 10 x 10 x 10 x 10 = 175,650,000
Factorialο
You may have used the factorial for simple arithmetic calculations.
Another use for the factorial function is to count how many ways you can choose things from a collection of things or find how many ways things can be arranged.
Exampleο
Counting the the number of ways to order the letters A, B, and C. We will define a specific arrangement or order as a permutation. You could likely figure this out by just writing out all of the permutations:
{ABC,ACB,BAC,BCA,CAB,CBA}
Itβs clear that there are 6 permutations. what if you had to do the same for all 26 letters in the alphabet? if you didnβt feel like writing out the 26 letters over and over and over, you could use the factorial for a more elegant solution.
the number of permutations when ordering A,B and C is 3!
3 β 2 β 1 = 6
Another example, In how many ways can 7 different books be arranged on a shelf?
We could use the Multiplication Principle to solve this problem. We have seven positions that we can fill with seven books. There are 7 possible books for the first position, 6 possible books for the second position, five possible books for the third position, and so on. The Multiplication Principle tells us therefore that the books can be arranged in:
Alternatively, we can use the simple rule for counting permutations. P = 7! = 5040
Python Solutionο
from math import factorial
print(factorial(3))
print(factorial(6))
6
720
Binomial Coefficientο
The binomial coefficient is a mathematical formula that counts the number of ways to choose k items from a collection of n items. This is perhaps the most useful counting tool. which in english is pronounced n choose x = \(\tbinom{n}{k}\).
With replacement
means the same item can be chosen more than once.
Without replacement
means the same item cannot be selected more than once.
Permutationο
Permutation relates to the act of arranging all the members of a set into some sequence or order.
Any ordered sequence of k objects taken from a set of n distinct objects is called a permutation of size k.
When selecting more than one item without replacement and order does matter
.
Exampleο
Suppose an organization has 60 members. One person is selected at random to be the president, another
person is selected as the vice-president, and a third is selected as the treasurer.
How many ways can this be done? (This would be the cardinality of the sample space.)
Combinationο
When selecting more than one item without replacement and order does not matter
.
Given n distinct objects, any unordered subset of size k of the objects is called a combination.
Exampleο
Suppose we have 60 people and want to choose a 3 person team (order is not important). How many combinations are possible?

Suppose we have the same 60 people, 35 are female and 25 are male. We need to select a committee of 11 people.
How many ways can such a committee be formed?
What is the probability that a randomly selected committee will contain at least 5 men and at least 5
women? (Assume each committee is equally likely.)
What is the probability of drawing the ace of spades twice in a row? (Assume that any card drawn on the first draw will
be put back in the deck before the second draw.)
You draw a card from a deck of cards. After replacing the drawn card back in the deck and shuffling thoroughly,
what is the probability of drawing the same card again?
Use $n \choose k$ to calculate the probability of throwing three heads in five coin tosses.
Twelve (12) patients are available for use in a research study. Only seven (7) should be assigned to receive the study
treatment. How many different subsets of seven patients can be selected?
Torch combinationsο
import torch
a = torch.tensor([1, 2, 3])
print(torch.combinations(a))
print(torch.combinations(a, r=3))
torch.combinations(a, with_replacement=True)
tensor([[1, 2],
[1, 3],
[2, 3]])
tensor([[1, 2, 3]])
tensor([[1, 1],
[1, 2],
[1, 3],
[2, 2],
[2, 3],
[3, 3]])
Difference Between Permutation and Combinationο
Permutation |
Combination |
---|---|
Order matters |
Order doesnβt matter |
Number of ways to arrange the elements of a set. |
Number of ways to choose k elements from a set of n elements. |
Arranging people, digits, numbers, alphabets, letters, and colours. |
Selection of menu, food, clothes, subjects, the team. |
Picking a President, VP and Waterboy from a group of 10. |
Picking a team of 3 people from a group of 10. |
Listing your 3 favorite desserts, in order, from a menu of 10. P(10,3) = 720. |
Choosing 3 desserts from a menu of 10. C(10,3) = 120. |
Sampling Tableο
Order Matters |
Order Doesnβt Matter |
|||
With Replacement |
\(n^k\) |
\({n+k-1 \choose k}\) |
||
Without Replacement |
\(\frac{n!}{k!(n-k)!}\) |
\(\binom nk\) |
Interview Questionsο
There are 25 students in a class. Find the number of ways in which a committee of 3 students is to be formed?
25 choose 3 2300In a meeting between two countries, each country has 12 delegates. All the delegates of one country shake hands with all delegates of the other country. Find the number of handshakes possible?
Total number of handshakes = 12 x 12 = 144How many groups of 6 persons can be formed from 8 men and 7 women?
Total number of person = 8 men + 7 women = 15
15 choose 6 = 5005
Bayes Theoremο
Definitionο
Bayes theorem is also known as the formula for the probability of causes.
Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event.
Conditional Probabilityο
Two events A and B from the same sample space S
. Calculate the probability of event A knowing that event B has occurred.
B is the βconditioning eventβ. \(P(A|B)\)
Conditional Probability is \(P(A \mid B)=\frac{P(A \cap B)}{P(B)}, \quad P(B)>0\)
Multiplication Ruleο
This leads to the multiplication rule \(P(A \cap B) = P(B) P(A \mid B) = P(A) P(B \mid A)\)
Bayes Theoremο
Bayes Theorem \(P(A \mid B) = \frac{P(B \mid A)P(A)} {P(B)}\)
Exampleο
Bayes Theorem can detect spam e-mails.
Assume that the word offer occurs in 80% of the spam messages.
Also assume offer occurs in 10% of my desired e-mails.
Question
If 30% of the received e-mails are considered as a scam, and I will receive a new message which contains βofferβ, what is the probability that it is spam?
I received 100 e-mails.
I have 30 spam e-mails
70 desired e-mails.
The percentage of the word βofferβ that occurs in spam e-mails is 80%. It means 80% of 30 e-mail and it makes 24. Now, I know that 30 e-mails of 100 are spam and 24 of them contain βofferβ where 6 of them not contains βofferβ.
The percentage of the word βofferβ that occurs in the desired e-mails is 10%. It means 7 of them (10% of 70 desired e-mails) contain the word βofferβ and 63 of them not.

The question was what is the probability of spam where the mail contains the word βofferβ:
We need to find the total number of mails which contains βofferβ ; 24 +7 = 31 mail contain the word βofferβ
Find the probability of spam if the mail contains βofferβ ;
In 31 mails 24 contains βofferβ means 77.4% = 0.774 (probability)
NOTE: In this example, I choose the percentages which give integers after calculation. As a general approach, you can think that we have 100 units at the beginning so if the results are not an integer, it will not create a problem. Such that, we cannot say 15.3 e-mails but we can say 15.3 units.
Solution with Bayesβ Equation:
A = Spam
B = Contains the word βofferβ
P( contains offer|spam) = 0.8 (given in the question)
P(spam) = 0.3 (given in the question)
Now we will find the probability of e-mail with the word βofferβ. We can compute that by adding βofferβ in spam and desired e-mails. Such that;
P(contains offer) = 0.30.8 + 0.70.1 = 0.31
As it is seen in both ways the results are the same. In the first part, I solved the same question with a simple chart and for the second part, I solved the same question with Bayesβ theorem.
Law of Total Probabilityο
\(B=(B \cap A) \cup\left(B \cap A^{c}\right)\)
\(P(B)=P(B \cap A)+P\left(B \cap A^{c}\right)=P(B \mid A) P(A)+P\left(B \mid A^{c}\right) P\left(A^{c}\right)\)
Independence and Mutually Exclusive Eventsο
Two events are independent
if knowing the outcome of one event does not change the probability of the other.
Flip a two-sided coin repeatedly. Knowing the outcome of one flip does not change the probability of the next.
Two events, A and B, are independent if \(P(A|B) = P(A)\), or equivalently \(P(B|A) = P(B)\).
Recall:
\(P(A \mid B)=\frac{P(A \cap B)}{P(B)}\)
then, if A and B are independent, we get the multiplication rule for independent events:
\(P(A \cap B)=P(A) P(B)\)
Exampleο
Suppose your company has developed a new test for a disease. Let event A be the event that a randomly selected individual has the disease and, from other data, you know that 1 in 1000 people has the disease. Thus, P(A) = .001. Let B be the event that a positive test result is received for the randomly selected individual. Your company collects data on their new test and finds the following:
\(P(B|A)\) = .99
\(P(B^c |A)\) = .01
\(P(B|A^c )\) = .02
Calculate the probability that the person has the disease, given a positive test result. That is,
find \(P(A|B)\).
Random Variablesο
The first step to understand random variable is to do a fun experiment. Go outside in front of your house with a pen and paper. Take note of every person you pass and their hair color & height in centimeters. Spend about 10 minutes doing this.
Congratulations! You have conducted your first experiment! Now you will be able to answer some questions such as:
How many people walked past you?
Did many people who walked past you have blue hair?
How tall were the people who walked past you on average?
You pass 10 people in this experiment, 3 of whom have blue hair, and their average height may be 165.32 cm. In each of these questions, there was a number; a measurable quantity was attached.
Definitionο
A random variable rv is a real-valued function, whose domain is the entire sample space of an experiment.
Think of the domain as the set of all possible values that can go into a function. A function takes the domain/input,
processes it, and renders an output/range. This set of real values obtained from the random variable is called its
range
.
A random variable (rv) is a function that maps events (from the sample space S) to the real numbers. Itβs a function which performs the mapping of the outcomes of a random process to a numeric value.
The domain of a random variable is a sample space, which is represented as the collection of possible outcomes of a random event. For instance, when a coin is tossed, only two possible outcomes are acknowledged such as heads or tails.
Denoted byο
Random variables Denote by a capital letters near the end of the alphabet (e.g. X, Y ).
Note
Why is it called a random variable?
Because we think of it as a variable that take random value intuitively. Formally they are function.
Probability Distributionο
A Probability Distribution is a graph, table, or function that gives the probability for each value of the random variable.
Requirmentsο
The sum of the probabilities is 1. \(\sum f(x)=1 \).
Every probability \(p_i\) is a number between 0 and 1. \( 0 \leq f(x) \leq 1\)
Difference between random variables and probability distributionsο
A random variable is a numerical description of the outcome of a statistical experiment. The probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable.
Types of Random Variablesο

- Discrete random variableο
A discrete random variable is a type of random variable that has a countable number of distinct values that can be assigned to it, such as in a coin toss.
- Continuous random variableο
A continuous random variable stands for any amount within a specific range or set of points and can reflect an infinite number of potential values, such as the average rainfall in a region.
Big Picture In statistics, we will model populations using random variables (e.g. mean, variance) of these random variables will tell us about the population we are studying.
Probability mass function (P.M.F)ο
The probability that a discrete random variable \(X\) takes on a particular value \(x\) that is \(P(X=x)\) is denoted by
Propertiesο
The probability mass function, \(P(X=x)=f(x)\), of a discrete random variable \(X\) is a function that satisfies the following properties:
All of the probabilities must be positive. \(P(X=x)=f(x)>0\), if \(x \in\) the support \(S\)
Sum of all probabilities of same sample space equals to 1. \(\sum_{x \in S} f(x)=1\)
\(P(X \in A)=\sum_{x \in A} f(x)\)
\(\text{Random variable}=X= \begin{cases} 1, & \text { if "Heads" } \\ 0, & \text { if "Tails" } \end{cases} = \begin{cases} P(X=1), & \text { if "Heads" } \\ P(X=0), & \text { if "Tails" } \end{cases}\)
\(PMF=f(x)=f_x(x)=P(X=x)= \begin{cases}1 / 2, & \text { if } x=0 \\ 1 / 2, & \text { if } x=1 \\ 0, & \text { otherwise }\end{cases}\)
Interview Question
Q: Let \(f(x)=c x^{2}\) for \(x=1,2,3\). Determine the constant \(c\) so that the function \(f(x)\) satisfies the conditions of being a probability mass function?
Answer: Using property no 2
Cumulative distribution function (CDF)ο
The cumulative distribution function (CDF or cdf) of the random variable X has the following definition:
Propertiesο
The cdf of random variable X has the following properties:
The cdf, \(F_{X}(t)\), ranges from 0 to 1 . This makes sense since \(F_{X}(t)\) is a probability.
If \(X\) is a discrete random variable whose minimum value is \(a\), then \(F_{X}(a)=P(X \leq a)=P(X=a)=f_{X}(a)\). If \(c\) is less than \(a\), then \(F_{X}(c)=0\).
If the maximum value of \(X\) is \(b\), then \(F_{X}(b)=1\).
Also called the distribution function.
Exampleο
Suppose X is a discrete random variable. Let the pmf of X be equal to
Suppose we want to find the cdf of \(X\). The cdf is \(F_{X}(t)=P(X \leq t)\).
For \(t=1, P(X \leq 1)=P(X=1)=f(1)=\frac{5-1}{10}=\frac{4}{10}\).
For \(t=2, P(X \leq 2)=P(X=1\) or \(X=2)=P(X=1)+P(X=2)=\frac{5-1}{10}+\frac{5-2}{10}=\frac{4+3}{10}=\frac{7}{10}\)
For \(t=3, P(X \leq 3)=\frac{5-1}{10}+\frac{5-2}{10}+\frac{5-3}{10}=\frac{4+3+1}{10}=\frac{9}{10}\).
For \(t=4, P(X \leq 4)=\frac{5-1}{10}+\frac{5-2}{10}+\frac{5-3}{10}+\frac{5-4}{10}=\frac{10}{10}=1\).
Probability density function (PDF)ο
X = f(x) is the probability density function of the continues random variable X.
f(x) = Curve under which area represent the probability \(P(a \leq X \leq b)=\int_{a}^{b} f(x) d x\)
Expected Value (Mean or Average)ο
The concept was first devised in the 17th century to analyze gambling games and answer questions such as:
How much do I gain - or lose - on average, if I repeatedly play a given gambling game?
How much can I expect to gain - or lose - by making a certain bet?
For example, if you play a game where you gain 2$ with probability 1/2 and you lose 1$ with probability 1/2, then the expected value of the game is half a dollar
it means that if you play this game many times, and the number of times each of the two possible outcomes occurs is proportional to its probability, then on average you gain 1/2$ each time you play the game.
Definitionο
The expected value or mean of a random variable is a weighted average of all possible outcomes. In the case of a continuum of possible outcomes, the expectation is defined by integration.
Denoted by \(\mu_x\) or \(E(X)\).
Exampleο
5 exams result : 70 +80 + 80 + 90 + 90
\(Avg = \frac{70+80+80+90+90}{5} = \frac{1}{5}(70)+\frac{2}{5}(80)+\frac{2}{5}(90) = 82.5 \)
Let X represent the outcome of a roll of a fair six-sided die. The possible values for X are 1, 2, 3, 4, 5, and 6, all of which are equally likely with a probability of \(1/6\) The Expected Value of X is
\(E[X] = 1\cdot\frac16 + 2\cdot\frac16 + 3\cdot\frac16 + 4\cdot\frac16 + 5\cdot\frac16 + 6\cdot\frac16 = (1+2+3+4+5+6) / 6= 3.5\)
x |
1 |
2 |
3 |
---|---|---|---|
P(X=x) |
1/4 |
1/4 |
1/2 |
\(E[X] =(1)(1 / 4)+(2)(1 / 4)+(3)(1 / 2) = 9/4 = 2.25 = \sum_{x} x P(X=x)\)
Imagine a game in which, on any play, a player has a 20% chance of winning \(3 and an 80% chance of losing \)1. The probability mass function of the random variable , the amount won or lost on a single play is:
so the average amount won (actually lost, since it is negative)
In the long run you guaranteed to lose no more than 20 cents.
Pytorch implementationο
import torch
# Create a tensor
T = torch.Tensor([2.453, 4.432, 0.754, -6.554])
print("T:", T)
# Compute the mean and standard deviation
mean = torch.mean(T)
print("mean:", mean)
T: tensor([ 2.4530, 4.4320, 0.7540, -6.5540])
mean: tensor(0.2713)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
data = torch.randn(25)
print(data)
print("Mean :", torch.mean(data))
_ =sns.displot(data,kde=True, )
plt.axvline(torch.mean(data), color='green')
plt.show()
tensor([ 0.8597, -0.6286, -1.7821, -0.4426, 1.1676, -0.2283, -1.2379, -1.2242,
-0.7144, 1.3822, 0.9386, 0.4657, -1.2674, 0.6671, 0.3613, -1.8003,
-0.3120, -1.0845, -1.9948, -0.5006, 0.2419, -1.2459, 0.3558, 1.7665,
-0.3110])
Mean : tensor(-0.2627)

Propertiesο
Expectation is a linear operator, which means for our purposes it has a couple of nice properties.
Expected value of a constantο
A perhaps obvious property is that the expected value of a constant is equal to the constant itself.
Scalar multiplication of a random variableο
If X is a random variable and a is a constant, then
Expectation of a product of random variablesο
Let X and Y be two random variables. In general, there is no easy rule or formula for computing the expected value of their product. However, if X and Y are statistically independent, then
Expectation of a sum of random variablesο
If random variables is functionο
\(E(a X+b)=\sum_{k}(a X+b) P(X=k)\)
\(E(a X+b)= a \sum_{k} k P(X=k)+b \sum_{k} P(X=k)\)
\(E(a X+b)= a E(x) + b * 1 = a E(x) + b\)
Law of the Unconscious Statisticianο
IF X with pdf \(f_x(x)\) and g is a function Find π€[π(π·)]
Let Y=g(X). The pdf for Y is:
\(f_{Y}(y)=f_{X}\left(g^{-1}(y)\right) \cdot\left|\frac{d}{d y} g^{-1}(y)\right| = \text { So, } E[g(X)]=E[Y]=\int_{-\infty}^{\infty} y \cdot f_{Y}(y) d y\)
\(=\int_{-\infty}^{\infty} y \cdot f_{x}\left(g^{-1}(y)\right) \cdot\left|\frac{d}{d y} g^{-1}(y)\right| d y\)
\(\text { Let } x=g^{-1}(y) \text {. Then } d x=\frac{d}{d y} g^{-1}(y) d y\)
\(E[g(X)]=\int_{-\infty}^{\infty} g(x) f_{X}(x)) d x\)
Varianceο
Measures how far we expect our random variable to be from the mean.
Measures of spread of a distribution.
Variance is a measure of dispersion.
Denoted byο
To better understand the definition of variance, we can break up its calculation in several steps:
Compute the expected value of \(X\), denoted by \(\mathrm{E}[X]\).
Construct a new random variable \(Y=X-\mathrm{E}[X]\) equal to the deviation of \(X\) from its expected value.
Take the square \( Y^{2}=(X-\mathrm{E}[X])^{2} \) which is a measure of distance of \(X\) from its expected value (the further \(X\) is from \(\mathrm{E}[X]\), the larger \(\left.Y^{2}\right)\)
Finally, compute the expectation of \(Y^{2}\) to know the average distance:
From these steps we can easily see thatο
variance is always positive because it is the expected value of a squared number.
the variance of a constant variable \(X\) (i.e., a variable that always takes on the same value) is zero; in this case, we have that \(X=\mathrm{E}[X], Y^{2}=0\) and \(\mathrm{E}\left[Y^{2}\right]=0\)
the larger the distance \(Y^{2}\) is on average, the higher the variance.
For continuous rvο
If X is a continuous random variable, the variance is defined by the integral of the probability density function. \(V(X)=\int_{-\infty}^{\infty} (x - \mu_x)^2 f(x) d x\)
\(V(X)=\int_{-\infty}^{\infty} (x - \mu_x)^2 f(x) d x\)
\(= \int_{-\infty}^{\infty}\left(x^{2}-2 \mu_{x} x+\mu_{x}^{2}\right) f(x) d x\)
\(= \int_{-\infty}^{\infty}x^{2} f(x) d x - 2 \mu_{x} \int_{-\infty}^{\infty}x f(x) d x + \mu_{x}^{2} \int_{-\infty}^{\infty}f(x) d x\)
\(V(X) = E(X^2)-E(X)^2\)
Propertiesο
Addition to a constantο
Let \(a \in \mathbb{R}\) be a constant and let \(X\) be a random variable.
Thanks to the fact that \(\mathrm{E}[a+X]=a+\mathrm{E}[X]\) (by linearity of the expected value), we have
Multiplication by a constantο
Let \(a \in \mathbb{R}\) be a constant and let \(x\) be a random variable.
Thanks to the fact that \(E[a X]=a E[X]\) (by linearity of the expected value), we obtain
Find Var[aX] = ?
Let Y = aX. Then, \(\mu_y = E[Y] = E[aX] = E[a\mu_x] = aE[\mu_x] = aE[X]\)
==> \(Var[aX] = Var[Y] = Var[(Y - \mu_y)^2] = a^2 Var[(X - \mu_x)^2] = a^2 V(X)\)
For Function
\(V(g(X))= \begin{cases}\sum_{k}(g(k)-E(g(X)))^{2} P(X=k), & X \text { discrete } \\ \int_{-\infty}^{\infty}(g(x)-E(g(X)))^{2} f(x) d x, & X \text { continuc }\end{cases}\)
Find V(a X+b)ο
\(V(a X+b)=E[(a X+b-E(a X+b))^2]\)
\(= E[(a x+ \not{b} -a E(x)- \not{b})^2]\)
\(= E[(a^2 (x - E(x))^2]\)
\(= a^2 E[(x - E(x)^2] = a^2 V(x)\)
Variance measure the spread the data B shift the data but doest not affect the spread.
Find Var[aX]ο
Let Y=aX. Then
Find Var[X + Y]ο
We will see that this is true if X and Y are independent.
Need concept of βcovarianceβ.
Standard Deviationο
The standard deviation is the square root of the variance. \(\sigma_x = \sqrt{V(X)}\)
Indicator functionο
The indicator function of an event is a random variable that takes
value 1 when the event happens;
value 0 when the event does not happen.
Let A = Set of real numbers
Other definition
The indicator function of a subset A of a set X is a function.
\(\text{Indicator function}_{A}(X) = \mathbf{1}_A(x) =\begin{cases} 1, & \text { if } A \cap X \neq \emptyset \\ 0, & \text { otherwise }\end{cases}\)
Notation= \(\mathbb{1} _{A}(x)\)
Random Sampleο
A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
Suppose that \(X_1, X_2, X_3, ..., X_n\) is a random sample from the Normal distribution with parameters \(\mu\) and \(sigma^2\). Mu and sigma are same for all random variables
Suppose that \(X_1, X_2, X_3, ..., X_n\) is a random sample from the gamma distribution with parameters \(alpha\) and \(\beta\).
Exampleο
A good example is a succession of throws of a fair coin: The coin has no memory, so all the throws are independent. And every throw is 50:50 (heads:tails), so the coin is and stays fair - the distribution from which every throw is drawn, so to speak, is and stays the same: identically distributed.
Independent and identically distributed random variables (IID)ο
Random Sample == IID
Discrete Distributionsο
A discrete distribution is a distribution of data in statistics that has discrete values. Discrete values are countable, finite, non-negative integers, such as 1, 10, 15, etc.
Bernoulli Distributionο
The Bernoulli distribution is a univariate discrete distribution used to model random experiments that have binary outcomes.
Bernoulli Random Variableο
A Bernoulli RV \(X \sim Bern(p)\) is a random variable that is either 0 or 1 with probability \(p\) or \(1-p\) respectively. Suppose that you perform an experiment with two possible outcomes: either success or failure.
Let X be a discrete random variable. \( x \in {0,1} \)
P.M.Fο
Using the indicator function notation
\(I_{A}(x)= \begin{cases}1, & \text { if } x \in A \\ 0, & \text { if } x \notin A\end{cases}\)
Mean (Expected Value)ο
The expected value of a Bernoulli random variable X is
Proof
Varianceο
The variance of a Bernoulli random variable X is
Proofο
\(E(X^2) =\sum_{k} k^2 P(X=k) = 1^2 * p = p\)
Geometric Distributionο
The geometric distribution is a discrete probability distribution that calculates the probability of the first success occurring during a specific trial.
The geometric distribution is the probability distribution of the number of failures we get by repeating a Bernoulli experiment until we obtain the first success.
Geometric Random Variableο
Definition 1
A geometric rv \(X \sim Geom(p)\) consists of
independent Bernoulli trials,
each with the same probability of success p or Failure (1-p),
repeated until the first success is obtained.
Definition 2
The geometric rv is the distribution of the number of trials needed to get the first success in repeated independent Bernoulli trials.
Example If we toss a coin until we obtain head, the number of tails before the first head has a geometric distribution.
Attention
It can also define as number of failures before first success.
Parameterο
The geometric distribution has one parameter, p = the probability of success for each trial. You denote the distribution as G(p), which indicates a geometric distribution with a success probability of p.
Usesο
Six in a series of die rolls?
Person to support a law during a repeated sampling for an interview?
Product to have a defect in a random sample from an assembly line?
Successful attempt for a project or task?
Propertiesο
Each trial is identical, and can result in a success or failure.
The probability of success, p, is constant from one trial to the next.
The trials are independent, so the outcome on any particular trial does not influence the outcome of any other trial.
Trials are repeated until the first success.
P.M.Fο
The sample space of geometric random variable is
Bernoulli trail success = 1 = P
Bernoulli trail failure = 0 = 1 - P
\(P(X=1)=p\)
\(P(X=2)=(1-p)p\)
\(P(X=3)=(1-p)(1-p)p\)
\(P(X=4)=(1-p)(1-p)(1-p)p\) = failure, failure, failure, Success
\(P(X=5)=(1-p)^{4}p\)
\(P(X=x)=(1-p)^{x-1}p\)
Mean (Expected Value)ο
The expected value of a geometric random variable X is
Varianceο
The expected value of a geometric random variable X is
Interview Question
Q: On each day we play a lottery in which the probability of winning is \(1%\).
What is the expected value of the number of days that will elapse before we win for the first time?
Answer: Each time we play the lottery, the outcome is a Bernoulli random variable (equal to 1 if we win), with parameter \(p=0.01\). Therefore, the number of days before winning is a geometric random variable with parameter \(p=0.01\). Its expected value is
Binomial Distributionο
The binomial distribution is a discrete probability distribution that calculates the probability an event will occur a specific number of times in a set number of opportunities.
Binomial Random Variableο
Definition 1
A binomial rv \(X \sim Bin(n,p)\) is a random variable that is the number of successes in n independent Bernoulli trials, each with probability p. The probability of success is p. The probability of failure is 1-p. The number of trials is n.
Definition 2
The binomial distribution is the distribution of the number of successes = X
in a fixed number = n
of
independent Bernoulli trials.
Parametersο
The binomial distribution has two parameters, n and p.
n: the number of trials.
p: the event or success probability.
Usesο
Use the binomial distribution when your outcome is binary. Binary outcomes have only two possible values that are mutually exclusive.
Six heads when you toss the coin ten times?
12 women in a sample size of 20?
Three defective items in a batch of 100?
Two flu infections over 20 years?
Propertiesο
Experiment is n trials (n is fixed in advance)
Trials are identical and result in a success or a failure (i.e. Bernoulli trials) with P(success) = p and P(failure) = 1 - p.
Trials are independent (outcome of one trial does not influence any other)
P.M.Fο
The sample space of binomial random variable is
\(f(x)=P(X=0)=P(\{00 \cdots 0\})=(1-p)^{n}\)
\(f(x)=P(X=1)=P(\{10 \cdots 0,0100 \ldots,0 \cdots 01\}) = n*p*(1-p)^{n-1}\)
\(f(x)=P(X=2)=P(\{11 \cdots 0,0110 \ldots,00 \cdots 11\}) = \binom{n}{2}p^2(1-p)^{n-2}\)
Explanation P(X=2):
Among n number of fixed trials, we have 2 bernoulli trials successes with probability P and
rest are failures bernoulli trails with probability (1-p). So, we need to choose 2 from n to get the exact probability
of success.
Where k = 1 (success) and n-k = 0 (failure).
Suppose n = 4
\(\mathrm{P}(X=3)=\mathrm{P}(\mathrm{SSSF} \text { or } \mathrm{SSFS} \text { or SFSS or FSSS })\)
Binomial Theoremο
\(\sum_{k=0}^n {n \choose k}p^{k}(1-p)^{n-k} = 1\)
Mean (Expected Value)ο
The expected value of a binomial random variable X is
Proof
RECALL: Bern(p) has expected value p. x1, x2 β¦ xn are independent bern p. so \(sum_{k=1}^n X_n = sum_{k=1}^n E[X_n] = n * p\)
Varianceο
The variance of a binomial random variable X is
\(V(X)= E(X^2) - E(X)^2 = n * p * (1-p)\)
Recall:
Bern(p) has variance p * (1-p).
Negative Binomial Distributionο
The negative binomial distribution is almost the same as a binomial distribution with one difference
Binomial distribution has a fixed number of trials.
Repeat independent Bernoulli trials until a total of r successes is obtained. The negative binomial random variable X counts the number of failures before the rth success.
Negative Binomial Random Variableο
The negative binomial rv \(X \sim NB(r,p)\) is the distribution of the number of trials = X
needed to get a
fixed number of successes = r
.
Propertiesο
The number of successes r is fixed in advance.
Trials are identical and result in a success or a failure (Bernoulli trials with P(success) = p and P(failure) = 1-p.
Trials are independent (outcome of one trial does not influence any other)
PMFο
\(S = \left\{\left(x_{1}, x_{2}, \ldots, x_{n}\right) \mid x_{i}\right. =\left\{\begin{array}{l} 1 \text { if } \text { success on ith trail } \\ 0 \text { if failure ith trail }\end{array}\right. and \sum_{i=1} = r\)
\(P(y=0)=P(\{11111\})=(p)^{5}\)
\(P(Y=1)=P(\{011111,101111,110111,111011,111101\}) = \binom{5}{4}p^5(1-p)^{5-4}\)
\(P(Y=2) = \binom{6}{4}p^5(1-p)^{5-4}\)
\(P(X = k) = \binom{k+r-1}{r-1} (1-p)^kp^r\)
Mean (Expected Value)ο
\(E(X)=\sum_{k} k P(X=k)\)
\(E(X)= \frac{r(1-p)}{p}\)
Varianceο
\(V(X)= \frac{r(1-p)}{p^2}\)
Relationship between Geometric and Negative Binomial rvο
\(X \sim Geom(p)\)
= Repeated, independent, identical, Bernoulli trails util first successes.
\(Y \sim NB(1,p)\)
= Count the number of failure until first success util first successes. =
\(\underbrace{}_{Failure} \underbrace{}_{Failure} success\)
Note:
Y = X - 1. then E(Y) = E(X) - 1 = 1/p - 1 = \(\frac{1-p}{p}\)
\(NB(r,p)\) = \(\underbrace{}_{Failure} \underbrace{}_{Failure} success \underbrace{}_{Failure} \underbrace{}_{Failure} success \underbrace{}_{Failure} \underbrace{}_{Failure} rth success\)
means we have stack geometric rv in a row rth time. thatβs why we multiply by r in expected value and variance in NB rv.
Poisson Distributionο
The Poisson distribution is a discrete probability distribution that describes probabilities for counts of events that occur in a specified observation space. It is named after SimΓ©on Denis Poisson.
Suppose that an event can occur several times within a given unit of time. When the total number of occurrences of the event is unknown, we can think of it as a random variable.
Poisson Random Variableο
A Poisson rv \(X \sim Poisson(\lambda)\) is a discrete rv that describes the total number of events that happen in a certain time period.
Parameterο
The Poisson distribution is defined by a single parameter, lambda (Ξ»),
which is the mean number of occurrences during an observation unit. A rate of occurrence is simply the mean count per standard observation period. For example, a call center might receive an average of 32 calls per hour.
Usesο
# of vehicles crossing a bridge in one day
# of gamma rays hitting a satellite per hour
# of cookies sold at a bake sale in one hour
# of customers arriving at a bank in a week
PMFο
A discrete random variable X has Poisson distribution with parameter (\(\lambda\) > 0) if the probability mass function of X is
which may also be written as
where
k is the number of occurrences (\(k = 0,1,2\dots\)) It could be zero because nothing happened in that time period.
e} is (e = 2.71828..)
While this pmf might appear to be highly structured, it really is the epitome of randomness. Imagine taking a 20 acre plot of land and dividing it into 1 square foot sections. (There are 871,200 sections!) Suppose you were able to scatter 5 trillion grass seeds on this land in a completely random way that does not favor one section over another. One can show that the number of seeds that fall into any one section follows a Poisson distribution with some parameter Ξ». More specifically, one can show that the Poisson distribution is a limiting case of the binomial distribution when n gets really large and p get really small. βSuccessβ here is the event that any given seed falls into one particular section. We then want to count the number of successes in 5 trillion trials.
In general, the Poisson distribution is often used to describe the distribution of rare events in a large population.
All probabilities sum to 1
\(\sum_{k=0}^{\infty} P(X=k)=\sum_{k=0}^{\infty} \frac{\lambda^{k}}{k !} e^{-\lambda}=e^{-\lambda} \sum_{k=0}^{\infty} \frac{\lambda^{k}}{k!} = e^{-\lambda} * e^{\lambda} = 1\)
Mean (Expected Value)ο
\(E(X)=\sum_{k=0}^{\infty} k P(X=k)=\sum_{k=0}^{\infty} k \frac{\lambda^{k}}{k !} e^{-\lambda}=\lambda \sum_{k=1}^{\infty} \frac{\lambda^{k-1}}{(k-1) !} e^{-\lambda} = \lambda\)
\(E\left(X^{2}\right)=\sum_{k=0}^{\infty} k^{2} P(X=k)=\sum_{k=0}^{\infty} k^{2} \frac{\lambda^{k}}{k !} e^{-\lambda}=\lambda(\lambda+1)^{e}\)
Varianceο
\(V(X)=E\left(X^{2}\right)-(E(X))^{2}=\lambda(\lambda+1)-\lambda^{2}=\lambda\)
Continuous Distributionsο
Definitionο
A random variable is continuous if possible values comprise either a single interval on the number line or a union of disjoint intervals. X = f(x) is the probability density function of the continues random variable X.
We model a continuous random variable with a curve f(x), called a probability density function (pdf).


Applicationsο
In the study of the ecology of a lake, a rv X could be the depth measurements at randomly chosen locations.
In a study of a chemical reaction, Y could be the concentration level of a particular chemical in solution.
In a study of customer service, W could be the time a customer waits for service.
f(x) represents the height of the curve at point x.
For continuous random variables probabilities are areas under the curve.
Attention
We canβt model continuous random variable using discrete rv method.
Propertiesο
The probability density function \(f:(-\infty, \infty) \rightarrow[0, \infty) \text{ so } f (x) \geq 0\).
\(P(-\infty<X<\infty)=\int_{-\infty}^{\infty} f(x) d x=1=P(S)\)
\(P(a \leq X \leq b)=\int_{a}^{b} f(x) d x\)
Note
\(P(X=a)=\int_{a}^{a} f(x) d x=0 \text { for all real numbers } a\)
Uniform rvο
Random variable \(X \sim U[a,b]\) has the uniform distribution on the interval [a, b] if its density function is
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import uniform
sns.set_theme(style="darkgrid")
# random numbers from uniform distribution
n = 10000
start = 10
width = 20
data_uniform = uniform.rvs(size=n, loc = start, scale=width)
ax = sns.displot(data_uniform,
bins=100,
kde=True)
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')
plt.show()

CDFο
Expected Value and Varianceο
Exampleο
For random variable \(X \sim U(0,23)\). Find P(2 < X < 18)
\(P(2 < X < 18) = (18-2)\cdot \frac 1 {23-0} = \frac {16}{23}\)
Exponential Distributionο
The exponential distribution is a continuous probability distribution that often concerns the amount of time until some specific event happens. It is a process in which events happen continuously and independently at a constant average rate. The exponential distribution has the key property of being memoryless.
Applicationsο
The family of exponential distributions provides probability models that are widely used in engineering and science disciplines to describe time-to-event data.
Time until birth
Time until a light bulb fails
Waiting time in a queue
Length of service time
Time between customer arrivals
the amount of money spent by the customer
Calculating the time until the radioactive particle decays
PDFο
The continuous random variable, say X is said to have an exponential distribution, if it has the following probability density function:
Ξ» is called the distribution rate.
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import expon
sns.set_theme(style="darkgrid")
data_expon = expon.rvs(scale=1,loc=0,size=1000)
ax = sns.displot(data_expon,
kde=True,
bins=100)
ax.set(xlabel='Exponential Distribution', ylabel='Frequency')
plt.show()

Expected Valueο
The mean of the exponential distribution is calculated using the integration by parts.
Varianceο
To find the variance of the exponential distribution, we need to find the second moment of the exponential distribution
Propertiesο
The most important property of the exponential distribution is the memoryless property. This property is also applicable to the geometric distribution.
Normal (Gaussian) Distributionο
It is often called Gaussian distribution, in honor of Carl Friedrich Gauss (1777-1855), an eminent German mathematician who gave important contributions towards a better understanding of the normal distribution.
Normal Random Variableο
A continuous random variable \(X \sim N(\mu,\sigma^2)\) has the normal distribution with parameters \(\mu\) and \(\sigma^2\) if its density is given by
A normal distribution is a distribution that is solely dependent on two parameters of the data set: mean and the standard deviation of the sample.

Attention
This characteristic of the distribution makes it extremely simple for statisticians and hence any variable that exhibits normal distribution is feasible to be forecasted with higher accuracy. Essentially, it can help in simplying the model.
Parametersο
Mu is a location parameter. If you change the value of Mu, the entire bell curve is going to slide around.
If you increase Sigma squared, itβs going to get fatter and therefore shorter because the total area is one, So if it gets fatter, it has to come down. If Sigma squared gets smaller, itβs going to get really tall and thin.
Propertiesο
Normal distribution is simple to explain: Why?
The reasons are:
The mean, mode, and median of the distribution are equal.
We only need to use the mean and standard deviation to explain the entire distribution.
f(x) is symmetric around \(x=\mu\) as a consequence, deviations from the mean having the same magnitude.
f(x) > 0 for all \(x\) and \(\int_{-\infty}^{\infty} f(x) dx = 1\).
\(\mu + \sigma\) and \(\mu - \sigma\) are inflection points on f(x).
Mean and median are equal; both are located at the center of the distribution.
Why is it so importantο
The normal distribution is extremely important because:
many real-world phenomena involve random quantities that are approximately normal
it plays a crucial role in the Central Limit Theorem, one of the fundamental results in statistics;
its great analytical tractability makes it very popular in statistical modelling.
The following variables are close to normally distributed variables:
Height of a population
Blood pressure of adult human
Position of a particle that experiences diffusion
Measurement errors
Residuals in regression
Shoe size of a population
Amount of time it takes for employees to reach home
A large number of educational measures
But how are so many variables approximately normally distributed?ο
Letβs consider that the height of a population is a random variable. We can take a sample of heights, plot its distribution and calculate the sample mean. When we repeat this experiment whilst we increase the number of samples then the mean of the samples will end up being very close to normality.
This is known as the Central Limit Theorem.
Probability Density Functionο
If we plot the normal distribution density function, itβs curve has the following characteristics:

Mean is the center of the curve. This is the highest point of the curve as most of the points are at the mean.
There is an equal number of points on each side of the curve. The center of the curve has the most number of points.
The total area under the curve is the total probability of all of the values that the variable can take.
The total curve area is therefore 100%

Approximately 68.2% of all of the points are within the range -1 to 1 standard deviation.
About 95.5% of all of the points are within the range -2 to 2 standard deviations.
About 99.7% of all of the points are within the range -3 to 3 standard deviations.
This allows us to easily estimate how volatile a variable is and given a confidence level, what its likely value is going to be. As an instance, in the grey bell-shaped curve above, there is a 68.2% chance that the value of the variable will be within 101β99.
Normal Probability Distribution Functionο

import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
sns.set_theme(style="darkgrid")
sample = torch.normal(mean = 8, std = 16, size=(1,1000))
sns.displot(sample[0], kde=True, stat = 'density',)
plt.axvline(torch.mean(sample[0]), color='red', label='mean')
plt.show()

norm.pdf returns a PDF value. The following is the PDF value when π₯=1, π=0, π=1.
We graph a PDF of the normal distribution using scipy, numpy and matplotlib. We use the domain of β4<π₯<4, the range of 0<π(π₯)<0.45, the default values π=0 and π=1. plot(x-values,y-values) produces the graph.
print(norm.pdf(x=1.0, loc=0, scale=1))
x = torch.arange(-4,4,0.001)
fig, ax = plt.subplots()
ax.set_title('N(0,$1^2$)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.plot(x, norm.pdf(x))
ax.set_ylim(0,0.45)
plt.show()
0.24197072451914337

A normal curve is smooth bell-shaped. It is symmetrical about the π₯=π and has a maximum point at π₯=π.
Normal distribution PDF with different standard deviationsο
Letβs plot the probability distribution functions of a normal distribution where the mean has different standard deviations.
scipy.norm.pdf has keywords, loc and scale. The location (loc) keyword specifies the mean and the scale (scale) keyword specifies the standard deviation.
fig, ax = plt.subplots()
x = torch.arange(-4,4,0.001)
stdvs = [1.0, 2.0, 3.0, 4.0]
for s in stdvs:
ax.plot(x, norm.pdf(x,scale=s), label='stdv=%.1f' % s)
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.set_title('Normal Distribution')
ax.legend(loc='best', frameon=True)
ax.set_ylim(0,0.45)
ax.grid(True)

Normal distribution PDF with different meansο
Letβs plot the probability distribution functions of a normal distribution where the mean has different values.
fig, ax = plt.subplots()
x = torch.linspace(-10,10,100)
means = [0.0, 1.0, 2.0, 5.0]
for mean in means:
ax.plot(x, norm.pdf(x,loc=mean), label='mean=%.1f' % mean)
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.set_title('Normal Distribution')
ax.legend(loc='best', frameon=True)
ax.set_ylim(0,0.45)
ax.grid(True)

The mean of the distribution determines the location of the center of the graph. As you can see in the above graph, the shape of the graph does not change by changing the mean, but the graph is translated horizontally.
A cumulative normal distribution functionο
The cumulative distribution function of a random variable X, evaluated at x, is the probability that X will take a value less than or equal to x. Since the normal distribution is a continuous distribution, the shaded area of the curve represents the probability that X is less or equal than x.
fig, ax = plt.subplots()
# for distribution curve
x= torch.arange(-4,4,0.001)
ax.plot(x, norm.pdf(x))
ax.set_title("Cumulative normal distribution")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
# for fill_between
px=torch.arange(-4,1,0.01)
ax.set_ylim(0,0.5)
ax.fill_between(px,norm.pdf(px),alpha=0.5, color='g')
# for text
ax.text(-1,0.1,"cdf(x)", fontsize=20)
plt.show()

Expected Value and Varianceο
\(E(X) = \mu\)
\(V(X) = \sigma^2\)
Problems With Normalityο
Assuming normality has its own flaws. As an instance, we cannot assume that the stock price follows normal distribution as the price cannot be negative. Therefore the stock price potentially follows a log of the normal distribution to ensure it is never below zero.
We know that the daily returns can be negative, therefore the returns can at times follow a normal distribution. It is not wise to assume that the variable follows a normal distribution without any analysis.
A variable can follow Poisson, Student-t, or Binomial distribution as an instance and falsely assuming that a variable follows normal distribution can lead to inaccurate results.
Identify the Normal RV?ο
Many methods exist for testing whether a variable has a normal distribution
1. Histogramο
The histogram is a data visualization that shows the distribution of a variable. It is a bar graph that shows the frequency of each value in the variable. The histogram is a graphical representation of the distribution of a variable.
sample = torch.normal(mean = 8, std = 16, size=(1,1000))
sample2 = torch.distributions.uniform.Uniform(2,3).sample([1,1000])
sns.displot(sample[0], kde=True,).set(title='Normal Distribution')
plt.axvline(torch.mean(sample[0]), color='red', label='mean')
sns.displot(sample2[0], kde=True,).set(title='Uniform Distribution')
plt.show()


2. Box Plotο
The Box Plot is another visualization technique that can be used for detecting non-normal samples.
The box plot is a graphical representation of the distribution of a variable. It is a graphical representation of the five-number summary of a variable. The five-number summary is the minimum, first quartile, median, third quartile, and maximum of a variable.
3. QQ Plotο
QQ Plot stands for Quantile vs Quantile Plot, which is exactly what it does: plotting theoretical quantiles against the actual quantiles of our variable.
The QQ Plot allows us to see deviation of a normal distribution much better than in a Histogram or Box Plot.
Standard Normal Distributionο
The normal distribution with parameter values \(\mu\) = 0 and \(\sigma^2\) = 1 is called the standard normal distribution.
A rv with the standard normal distribution is denoted by \(Z \sim N(0, 1)\)
If \(X \sim N\left(\mu, \sigma^2\right)\) then
If \(Z \sim N(0,1)\) then
PDFο
\(f_{Z}(x)=\frac{1}{\sqrt{2 \pi}} e^{-x^{2} / 2} \text { for }-\infty<x<\infty\)
Cumulative distribution functionο
We use special notation to denote the cdf of the standard normal curve
\(F(z)=\Phi(z)=P(Z \leq z)=\int_{-\infty}^{z} \frac{1}{\sqrt{2 \pi}} e^{-x^{2} / 2} d x\)

Propertiesο
The standard normal density function is symmetric about the y axis.
The standard normal distribution rarely occurs naturally.
Instead, it is a reference distribution from which information about other normal distributions can be obtained via a simple formula.
The cdf of the standard normal, \(\Phi(z)\) can be found in tables and it can also be computed with a single command in R.
As weβll see, sums of standard normal random variables play a large role in statistical analysis.
Propositionο
\(\text { If } X \sim N\left(\mu, \sigma^{2}\right), \text { then } \frac{X-\mu}{\sigma} \sim N(0,1)\)
\(\frac{X-\mu}{\sigma}\) Shifted by \(\mu\) or (Centered at zero) and scaled by \(\frac{1}{\sigma}\) that will give us variance of 1.
\(\mathrm{Z} \sim \mathrm{N}(0,1) \Rightarrow \sigma \mathrm{Z}+\mu \sim \mathrm{N}\left(\mu, \sigma^2\right)\)
Exampleο
Let \(X \sim N(2,3)\)
Proving this propositionο
For any continuous random variable. Suppose we have Y rv, with Desnity function \(f_{Y}(y)\)
We know
\(P(y \leqslant a)=\int_{-\infty}^{a} f_{y}(y) d y\)
What if
\(P(2y \leqslant a)\)
= Canβt really use the density function until we isolate y =
\(P\left(y \leq \frac{a}{2}\right) = \int_{-\infty}^{a / 2} f_{y}(y) d y\)
This true for all transformation of Y.
With \(P\left(\frac{x-\mu}{\sigma} \leq a\right)=P(x \leq a \sigma+\mu) = \int_{x}^{a \sigma+\mu} \frac{1}{\sqrt{2 \pi} \sigma} e^{-(x-\mu)^{2} / 2 \sigma^{2}}\)
U subsitution
Let
\(u=\frac{x-\mu}{\sigma}\)
\(d u=\frac{1}{\sigma} d x\)
SO \(= \int_{-\infty}^{a} \frac{1}{\sqrt{2 \pi}} e^{-u^{2} / 2} d u\) This is density function for N(0,1).
Examplesο
Find P(X<2) when N(3, 2)?ο
In norm.cdf, the location (loc) keyword specifies the mean and the scale (scale) keyword specifies the standard deviation.
from scipy.stats import norm
val = norm.cdf(x=2, loc=3, scale=2)
print(f"P(X<2) = {val}")
fig, ax = plt.subplots()
x= torch.arange(-4,10,0.001)
ax.plot(x, norm.pdf(x,loc=3,scale=2))
ax.set_title("N(3,$2^2$)")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
# for fill_between
px=torch.arange(-4,2,0.01)
ax.set_ylim(0,0.25)
ax.fill_between(px,norm.pdf(px,loc=3,scale=2),alpha=0.5, color='g')
# for text
ax.text(-0.5,0.02,round(val,2), fontsize=20)
plt.show()
P(X<2) = 0.3085375387259869

R code: pnorm(1.2)
Find P(X<4.1) when N(2, 3)?ο
Let \(X \sim N(2,3)\). Then
R Code: pnorm(1.21)
z_score <- (4.1 - 2) / sqrt(3)
pnorm(z_score)
Interval between variablesο
To find the probability of an interval between certain variables, you need to subtract cdf from another cdf.
Letβs find π(0.5<π<2) with a mean of 1 and a standard deviation of 2.
print(norm(1, 2).cdf(2) - norm(1,2).cdf(0.5))
0.2901687869569368
fig, ax = plt.subplots()
# for distribution curve
x= torch.arange(-6,8,0.001)
ax.plot(x, norm.pdf(x,loc=1,scale=2))
ax.set_title("N(1,$2^2$)")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
px=torch.arange(0.5,2,0.01)
ax.set_ylim(0,0.25)
ax.fill_between(px,norm.pdf(px,loc=1,scale=2),alpha=0.5, color='g')
pro=norm(1, 2).cdf(2) - norm(1,2).cdf(0.5)
ax.text(0.2,0.02,round(pro,2), fontsize=20)
plt.show()

P(Z > 1.25) ?ο
Letβs plot a graph.
fig, ax = plt.subplots()
x= torch.arange(-4,4,0.01)
gr4sf=norm.sf(x=1.25, loc=0, scale=1)
print(gr4sf)
ax.plot(x, norm.pdf(x,loc=0,scale=1))
ax.set_title("N(0,$1^2$)")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
px=torch.arange(1.25,4,0.01)
#ax.set_ylim(0,0.15)
ax.fill_between(px,norm.pdf(px,loc=0,scale=1),alpha=0.5, color='g')
ax.text(1.25,0.02,"sf(x) %.2f" %(gr4sf), fontsize=20)
plt.show()
0.10564977366685535

we can use sf which is called the survival function and it returns 1-cdf.
If X = N(1, 4), find P(0 < X < 3.2)?ο
print(norm(1, 2).cdf(3.2) - norm(1,2).cdf(0))
0.5557964003276304
fig, ax = plt.subplots()
# for distribution curve
x= torch.arange(-6,8,0.001)
ax.plot(x, norm.pdf(x,loc=1,scale=2))
ax.set_title("N(1,$2^2$)")
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
ax.grid(True)
px=torch.arange(0.5,2,0.01)
ax.set_ylim(0,0.25)
ax.fill_between(px,norm.pdf(px,loc=1,scale=2),alpha=0.5, color='g')
pro=norm(1, 2).cdf(3.5) - norm(1,2).cdf(0)
ax.text(0.2,0.02,round(pro,2), fontsize=20)
plt.show()

Gamma Distributionο
The gamma distribution term is mostly used as a distribution which is defined as two parameters β shape parameter and inverse scale parameter, having continuous probability distributions. Its importance is largely due to its relation to exponential and normal distributions.
Gamma distributions have two free parameters, named as alpha (Ξ±) and beta (Ξ²), where;
Ξ± = Shape parameter
Ξ² = Rate parameter (the reciprocal of the scale parameter)
The scale parameter Ξ² is used only to scale the distribution. This can be understood by remarking that wherever the random variable x appears in the probability density, then it is divided by Ξ². Since the scale parameter provides the dimensional data, it is seldom useful to work with the βstandardβ gamma distribution, i.e., with Ξ² = 1.
Gamma function:ο
The gamma function \([10]\), shown by \(\Gamma( x )\), is an extension of the factorial function to real (and complex) numbers. Specifically, if \(n \in\{1,2,3, \ldots\}\), then
Probability Density Functionο
Meanο
Varianceο
Chi-squared Distributionο
One Parameter:
degrees of freedom: \(n \geq 1\) ( \(n\) is an integer) \(X \sim \chi^2(n)\) is defined as \(\Gamma\left(\frac{n}{2}, \frac{1}{2}\right)\)
meanο
varianceο
T-distributionο
Let \(Z \sim N(0,1)\) and \(W \sim \chi^2(n)\) be independent random variables. Define
the t-distribution Write \(X \sim t ( n )\) One Parameter: degrees of freedom: \(n \geq 1\) ( \(n\) is an integer) The pdf:
Joint Distributionsο
Discrete Definitionο
Given two discrete random variables, X and Y , p(x, y) = P(X = x, Y = y) is the joint probability mass function for X and Y .
Important property X and Y are independent random variables if P(X = x, Y = y) = P(X = x)P(Y = y) for all possible values of x and y.
\(f(x,y) = P(X=x \, and \, Y=y) = P(X=x,Y=y)\)

Sum of all marginal probabilities is equal to 1. ( P(y=0) + P(y=100) = P(y = 200) = 1 )
Sum of all joint probabilities is equal to 1.

Marginal Probabilitiesο


Exampleο
An insurance agency services customers who have both a homeownerβs policy and an automobile policy. For each type of policy, a deductible amount must be specified. For an automobile policy, the choices are $100 or $250 and for the homeownerβs policy, the choices are $0, $100, or $200.

Recall
Two events are independent if P(A and B) = P(A)P(B) for all possible values of A and B.
X and y are not independent.
Continuous Definitionο
Definition: If X and Y are continuous random variables, then f(x, y) is the joint probability density function for X and Y if \(P(a \leq X \leq b, c \leq Y \leq d)=\int_{a}^{b} \int_{c}^{d} f(x, y) d x d y\) for all possible $a, b, c$, and $d$ Important property: $X$ and $Y$ are independent random variables if $f(x, y)=f(x) f(y)$ for all possible values of $x$ and $y$.
Exampleο
Example: Suppose a room is lit with two light bulbs. Let \(X_{1}\) be the lifetime of the first bulb and \(X_{2}\) be the lifetime of the second bulb. Suppose \(X_{1} \sim {Exp}\left(\lambda_{1}=1 / 2000\right)\) and \(X_{2} \sim {Exp}\left(\lambda_{2}=1 / 3000\right)\). If we assume the lifetimes of the light bulbs are independent of each other, find the probability that the room is dark after 4000 hours.
\(E\left(X_{1}\right)=\frac{1}{\lambda_{1}}=2000 \mathrm{hrs}, E\left(X_{2}\right)=\frac{1}{\lambda_{2}}=3000 \mathrm{hrs}\).
Light bulbs function independently,so
Covariance and Correlationο
Covarianceο
The covariance between two rvβs, X and Y, is defined as
\(\operatorname{Cov}(X, Y)=E[(X-E(X))(Y-E(Y))] = E[(X- \mu_x))(Y- \mu_y)]\)
The covariance depends on both the set of possible pairs and the probabilities for those pairs.
If both variables tend to deviate in the same direction (both go above their means or below their means at the same time), then the covariance will be positive.
If the opposite is true, the covariance will be negative.
If X and Y are not strongly (linearly) related, the covariance will be near 0.

Computational formula for Covarianceο
\(\operatorname{Cov}(X, Y)=E[XY] -E[X]E[Y]\)
Correlation Coefficientο
The correlation Coefficient of X and Y , denoted by Cor(X, Y ) Represented by the Greek letter ββΟββ (rho)
\(Cor(X, Y) = \rho_{X,Y}= \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}\)
It represents a βscaledβ covariance. The correlation is always between -1 and 1.
Transformations of Distributionsο
Discrete Distributionsο
Suppose that π· βΌ π»ππ(π, π) What is the distribution of Y = n-X?
\(f(x)=P(X=x)= \binom{n}{x}p^x(1-p)^{n-x} \cdot I_{\{1,2,3, \ldots\}}(x)\)
Just do it:
Continuous Distributionsο
Invertible functionsο
In the most general sense, are functions that βreverseβ each other. For example, if f takes a to b, then the inverse, \(f^{-1}\) must take b to a. a function is invertible only if each input has a unique output. That is, each output is paired with exactly one input. That way, when the mapping is reversed, it will still be a function!

For X discrete or continuous, the cumulative distribution function (cdf) Is denoted by F(x) and is defined by
\(F(X)= P(X < x)\)



Estimators and Sampling Distributionsο
We have learned many different distributions for random variables and all of those distributions had parameters: the numbers that you provide as input when you define a random variable.
What if we donβt know the values of the parameters. What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data.
These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this
specify a probabilistic model that has parameters.
Learn the value of those parameters from data.
Both of these schools of thought assume that your data are independent and identically distributed (IID) samples.
Random Sampleο
A collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.
Random Sample = \(X_1, X_2, X_3, ..., X_n\)
Suppose that \(X_1, X_2, X_3, ..., X_n\) is a random sample from the gamma distribution with parameters \(alpha\) and \(\beta\).
E.g
A good example is a succession of throws of a fair coin: The coin has no memory, so all the throws are βindependentβ. And every throw is 50:50 (heads:tails), so the coin is and stays fair - the distribution from which every throw is drawn, so to speak, is and stays the same: βidentically distributedβ.
Independent and identically distributed random variables (IID)ο
Random Sample == IID
Note
What are biased and unbiased estimators? A biased estimator is one that deviates from the true population value. An unbiased estimator is one that does not deviate from the true population parameter.
Parametersο
Before we dive into parameter estimation, first letβs revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution.
In the case of a Bernoulli random variable, the single parameter was the value p.
In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value.
we are going to use the notation \(\theta\) to be a vector of all the parameters.
Distribution |
Parameters |
---|---|
Bernoulli(p) |
\(\theta = p\) |
Poisson(Ξ») |
\(\theta = \lambda\) |
Uniform(a,b) |
\(\theta = (a,b)\) |
Normal |
\(\theta = (\mu,\sigma)\) |
\(Y = wX + b\) |
\(\theta = (w,b)\) |
Sampling Distributionsο
\(\theta\) will denote a generic parameter.
E.g
\(\theta = \mu , \theta = p , \theta = \lambda , \theta = (\alpha, \beta)\)
Estimator
\(\hat{\theta}\) = a Random variable,
\(\hat{\theta}=\bar{X}\)
Estimate
\(\hat{\theta}\) = a observed number
\(\hat{\theta}=\bar{x} = 42.5\)

We want our estimator of to be correct βon average.
\(\bar{X}\) is a random variable with its owo distribution and its own mean or expected value.
We would like sample mean \(π€[\bar{π·}] = ΞΌ\) to be close to the true mean or population mean \(ΞΌ\).
Important
If this is true, we say that \(\bar{π·}\) is an unbiased estimator of \(\mu\).
In general, \(\bar{\theta}\) is an unbiased estimator of \(\theta\). if \(E[\bar{\theta}] = \theta\).
Thatβs is really good thing.
Meanο
Let X1, X2, β¦, Xn be random sample from any distribution with mean \(\mu\).
That is \(E[X_i] = \mu\) for i = 1,2,3,β¦, n. Then
We have shown that, no matter what distribution we are working with, if the mean is \(\mu\) , \(\bar{X}\) is an unbiased estimator for \(\mu\).
Attention
We have shown that, no matter what distribution we are working with, if the mean \(\mu\) is , \(\bar{X}\) is an unbiased estimator for \(\mu\) .
Let X1, X2, β¦, Xn be random sample from any πΎππ(rate = \(\lambda\))
Let \(\bar{X}=\frac{1}{n} \sum_{i=1}^{n} X_{i}\) is the sample mean. We know, for the exponential distribution, that \(E[X_i]=\frac{1}{\lambda}\).
Then \(E[\bar{X}] = \frac{1}{\lambda}\)
Varianceο
Let X1, X2, β¦, Xn be random sample from any distribution with mean \(\mu\) and variance \(\sigma^2\).
We already know that \(\bar{X}\) is an unbiased estimator for \(\mu\) .
What can we say about the variance of \(\bar{X}\)?
\(Var[\bar{X}]=Var\left[\frac{1}{n} \sum_{i=1}^{n} X_{i}\right]= =\frac{1}{n^{2}} Var\left[\sum_{i=1}^{n} X_{i}\right] = =\frac{1}{n^{2}} \sum_{i=1}^{n} Var\left[X_{i}\right]\)
\(=\frac{1}{n^{2}} \sum_{i=1}^{n} \sigma^{2} = \frac{1}{n^{2}} n \sigma^{2} =\frac{\sigma^{2}}{\mathrm{n}}\)
Moments Generating Functionsο
We are still, believe it or not, trying to estimate things from a larger population based on a sample. For example, sample mean, or maybe the sum of the values in the sample etc. Any function of your data is known as a statistic. And weβre going to use them to estimate other things. And in order to figure out how well weβre doing, weβre going to need to know often the distributions of some of these statistics.
Distributions of sumsο
A lot of them depend on sums, so weβre going to start out by talking about the distribution of sums of random variables.
Suppose That,
Each X_i take value success (P) and failure (1-P). So summing all X_i is equal to sum of all success gives the value of Y. Which is binomial distribution.
Caution
Not all random variables are so easily interpreted by methods of Distributions of sums. So we need a tool.
Moment generating functionsο
The moments generating functions are the functions that generate the moments of a random variable. The expected values \(E(X), E\left(X^{2}\right), E\left(X^{3}\right), \ldots E\left(X^{r}\right)\) are called moments.
|
|
which are functions of moments. moment-generating functions can sometimes make finding the mean and variance of a random variable simpler.
Let X be a random variable. Itβs moment generating function (mgf) is denoted and defined as
- Continuous Random Variables:
\(M_{X}(t)=E\left[e^{t X}\right]=\int_{-\infty}^{\infty} e^{t x} f_{X}(x) d x\)
- Discrete Random Variables:
\(M_{X}(t)=E\left[e^{t X}\right]=\sum_{x} e^{t x} f_{x}(x)\)
where \(f_{X}(x)\) is the distribution of X.
Propertiesο
Moment generating functions also uniquely identify distributions.
MGT of Famous Distributionsο
Bernoulli(p)ο
Binomial(n,p)ο
\(X \sim bin(n, p)\)
- Binomial Theorem:
\((a + b)^n =\sum_{k=0}^{n}\binom{n}{k}a^k b^{n-k}\)
Finding Distributionsο
A moment-generating function uniquely determines the probability distribution of a random variable. if two random variables have the same moment-generating function, then they must have the same probability distribution.

We have just seen that the moment generating function of the sum. Is the moment generating function of one of them raised to the nth power.
Key pointsο
sum of n iid Bernoulli(p) random variables is bin(n, p)
sum of n iid exp(rate =lambda) random variables is Gamma(n, lambda)
sum of m iid bin(n,p) is bin(nm,p)
sum of n iid Gamma(alpha, beta) is Gamma(n alpha, beta)
sum of n iid \(N\left(\mu, \sigma^{2}\right) is N\left(n \mu, n \sigma^{2}\right)\).
sum of $n$ independent normal random variable with \(\mathrm{X}_{\mathrm{i}} \sim \mathrm{N}\left(\mu_{\mathrm{i}}, \sigma_{\mathrm{i}}^{2}\right)$ is $\mathrm{N}\left(\sum_{\mathrm{i}=1}^{\mathrm{n}} \mu_{\mathrm{i}}, \sum_{\mathrm{i}=1}^{\mathrm{n}} \sigma_{\mathrm{i}}^{2}\right)\)



Method of Moments Estimators(MMEs)ο
Method of moments means you set sample moments equal to population/theoretical moments.
It totally makes sense if youβre trying to estimate the mean or average out there in the entire population. That you should use the sample mean or sample average of the values in the sample, but what about parameters with not such an obvious interpretation?
Idea: Equate population and sample moments and solve for the unknown parameters.
Attention
Recall that the βmomentsβ of a distribution are defined as π€[π·], π€[π·π€], π€[π·π₯ ], β¦ These are distribution or βpopulationβ moments
\(\mu=E[X]\) is a probability weighted average of the values in the population.
\(\bar{X}\) is the average of the values in the sample.
It was natural for us to think about estimating $mu$ with the average in our sample.
\(\mathrm{E}\left[\mathrm{X}^{2}\right]\) is a probability weighted average of the squares of the values in the population.
It is intuitively nice to estimate it with the average of the squared values in the sample:
Egο
Maximum Likelihood Estimationο
Ideaο
Choose the value in the parameter space that makes the observed data βmost likelyβ.
Which of 0.2, 0.3, or 0.8 seems βmost likelyβ?
What if we only flip the coin twice? For i=1,2, let \(X_{i}=\begin{cases}1 & \text { if we get "Heads" on the ith flip } \\ 0, & \text { if we get "Tails" on the ith flip }\end{cases}\)
Let p=P(βHeadsβ on any one flip) Then X1, X2 βΌ Bernoulli(P) iid where π β {π’ . π€, π’ . π₯, π’ . πͺ}
Joint pmf Due to independence of the variables, we can write the joint pmf as
Tabulated values of the joint pmf

When we observe the data to be (0,0) i.e. (βTailsβ, βTailsβ), the value of p that gives the highest joint probability
(0.64) is 0.2
.When we observe the data to be (0,1) or (1,0) i.e. (βTailsβ, βHeadsβ) or (βHeadsβ, βTailsβ), the value of p that gives the highest joint probability
(0.21) is 0.3
.When we observe the data to be (1,1) i.e. (βHeadsβ, βHeadsβ), the value of p that gives the highest joint probability
(0.64) is 0.8
.
The maximum likelihood estimator for p is:
Introductionο
Given data \(X_1, X_2 ... X_n\), a random sample (iid) from a distribution with unknown parameter ΞΈ, we want to find the value of ΞΈ in the parameter space that maximizes our probability of observing that data.
For Discreteβ¦
If \(X_1, X_2 ... X_n\) are discrete, we can look at \(P\left(X_{1}=x_{1}, X_{2}=x_{2}, \ldots, X_{n}=x_{n}\right)\) as a function of ΞΈ, and find the ΞΈ that maximizes it. This is the joint pmf for \(X_1, X_2 ... X_n\).
For Continuousβ¦
If \(X_1, X_2 ... X_n\) are continuous is to maximize the joint pdf with respect to ΞΈ.
For Discreteβ¦
If \(X_1, X_2 ... X_n\) are discrete, we can look at \(P\left(X_{1}=x_{1}, X_{2}=x_{2}, \ldots, X_{n}=x_{n}\right)\) as a function of ΞΈ, and find the ΞΈ that maximizes it. This is the joint pmf for \(X_1, X_2 ... X_n\).
For Continuousβ¦
If \(X_1, X_2 ... X_n\) are continuous is to maximize the joint pdf with respect to ΞΈ.
The pmf/pdf for any one of is denoted by f(x)
. We will emphasize the dependence of f on a parameter ΞΈ by writing it as
The data (the xβs) are fixed.
Think of the xβs as fixed and the joint pdf as a function of ΞΈ.
Given the joint PDF, the data, the Xs are fixed, and we think of it as a function of theta and we want to find the value of theta that maximizes the joint probability density function or probability mass function.
Likelihood functionο
If we think of this as a function of theta, and the xβs as fixed, weβre going to rename the joint PDF. Weβre going to call it a likelihood function and write it as a capital L of theta L(ΞΈ).

Attention
Because I can multiply or divide my likelihood by a constant and not change where the maximum occurs, then we can actually define the likelihood to be anything proportional to the joint pdf. So we can throw out multiplicative constants, including multiplicative constants that involve Xs.
Bernoulli distributionο
\(X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} \text { Bernoulli }(p)\)
The joint probability mass function weβll get by multiplying the individual ones together, because these guys are IID independent and identically distributed. Now, fix the Xs. Those are stuck, fixed, not moving, and think of this as a function of p. The values of p that are allowed, the parameter space for this model, are all values of p between 0 and 1.
For example I have p^X_1 times p^X_2 times p^X_3 and thatβs going to be p to the sum of the Xs, and Iβve got 1 minus p^1 minus X_1, 1 minus p^1 minus X_2. If I add up those exponents, Iβm going to get an exponent of n minus the sum of the Xs, and I do have a product of indicators.
Drop the indicator stuff, so that is a multiplicative constant which is constant with respect to p. I think Iβm going to drop it. Why not make it simpler?
Log-likelihoodο
It is almost always easier to maximize the log-likelihood function due to properties of Logarithms.
\(ln(uv) = ln(u) + ln(v) \text{ and } ln(n)^V = v \times ln(n)\)
Important
The log function is an increasing function. So the log of the likelihood is going to have different values than the likelihood, but because log is increasing, this is not going to mess up the location of the maximum.
I want to maximize it with respect to p, so I'm going to take a derivative with respect to p and set it equal to 0.

This is our coin example again. But we have n flips, and we have the Bernoulliβs ones and zeros for heads and tails, and the value of p is unknown, itβs somewhere between 0 and 1. Weβre no longer restricted to 0.2, 0.3, and 0.8. The maximum likelihood estimator, is the sample mean of the ones and zeros. If you add up the ones and zeros, and divide by n, youβre really computing the proportion of ones in your sample. Youβre really computing the proportion of times you see heads in your sample. This maximum likelihood estimator, at least, in this case, makes a lot of sense.
Q/Aο
- Is maximum likelihood estimator Bernoulli unbiased?
the maximum likelihood estimator of is a biased estimator. Recall that if \(X_i\) is a Bernoulli random variable with parameter P, then \(E[X_i] = p\).
\[\begin{split}E(\hat{p})=E\left(\frac{1}{n} \sum_{i=1}^{n} X_{i}\right)=\frac{1}{n} \sum_{i=1}^{n} E\left(X_{i}\right)=\frac{1}{n} \sum_{i=1}^{n} p=\frac{1}{n}(n p)=p \\\end{split}\]
Exponential distributionο
\(X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} \text { Exponential }(rate =\lambda)\)
Our goal is to maximize this as a function of Lambda.
I want to make everything capital, and throw a hat on it. Here is our first continuous maximum likelihood estimator for Theta or Lambda.
The maximum likelihood estimator for \(\lambda\) is
Warning
Same as method of moments. Biased!
This is exactly what we got with method of moments. Because if Lambda is the rate of this distribution, the true distribution mean is 1 over Lambda. If you equate that to the sample mean x bar and solve for Lambda, in the method of moments case, we got 1 over x bar. We werenβt that happy about it because it was a biased estimator. Iβm trying to convince you that MLEs are everything. But theyβre not unbiased.
Normal distributionο
MLEs for Multiple and Support Parameters
Weβre going to to consider two cases
One is when theta is higher dimensional, so theta might be the vector of mu and sigma squared.
Other cases when the parameter is in the indicator.
\(X_{1}, X_{2}, \ldots, X_{n} \stackrel{\text { iid }}{\sim} N(\mu, \sigma^2)\)
The parameter space : \(-\infty<\mu<\infty, \quad \sigma^{2}>0\)
Solve for ΞΌ and Ο simultaneouslyο





The Invariance Propertyο
Evaluationο
Comparing the quality of different estimators

Variance, MSE, and Biasο
Note
If \(\hat{\theta}\) is an unbiased estimator of \(\theta\), its mean squared error is simply the variance of \(\theta\)
Practiseο
Let \(X_{1}, X_{2}, \ldots, X_{n}\) be a random sample from the Poisson distribution with parameter \(\lambda>0\). Let \(\hat{\theta}\) be the MLE for \(\lambda\). What is the mean-squared error of \(\widehat{\lambda}\) as an estimator of \(\lambda\) ?
MLE Propertiesο
Confidence Intervalο
A 95% confidence for the mean ΞΌ is given by (-2.14,3.07).
Thereβs no probability, where does the probability come in?
it comes in from random sampling. The fact that if you select a random sample and compute your estimator and if you got to do it again or if someone else did it, theyβre going to get a different estimator based on the randomness of the sample.
Collect your sample.
Estimate the parameter.
Return a confidence interval.
If you did this again, you would not get the same results! Different samples and different confidence intervals.

Multiple samples give multiple confidence intervals. 95% of them will correctly capture ΞΌ and 5% miss it.
Hypothesis Testingο
Statistical inference is the process of learning about characteristics of a population based on what is observed in a relatively small sample from that population. A sample will never give us the entire picture though, and we are bound to make incorrect decisions from time to time.
We will learn how to derive and interpret appropriate tests to manage this error and how to evaluate when one test is better than another. we will learn how to construct and perform principled hypothesis tests for a wide range of problems and applications when they are not.
What is Hypothesisο
Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter.
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.
Note
Due to random samples and randomness in the problem, we can different errors in our hypothesis testing. These errors are called Type I and Type II errors.
Type of hypothesis testingο
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and variance \(\sigma^2\)
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
sns.set_theme(style="darkgrid")
sample = torch.normal(mean = 0, std = 1, size=(1,1000))
sns.displot(sample[0], kde=True, stat = 'density',)
plt.axvline(torch.mean(sample[0]), color='red', label='mean')
plt.show()

Example of random sample after it is observed:
Based on what you are seeing, do you believe that the true population mean \(\mu\) is
This is below 3 , but can we say that \(\mu<3\) ?
This seems awfully dependent on the random sample we happened to get! Letβs try to work with the most generic random sample of size 8:
Let \(\mathrm{X}_1, \mathrm{X}_2, \ldots, \mathrm{X}_{\mathrm{n}}\) be a random sample of size \(\mathrm{n}\) from the \(\mathrm{N}\left(\mu, \sigma^2\right)\) distribution.
The Sample mean is
Weβre going to tend to think that \(\mu<3\) when \(\bar{X}\) is βsignificantlyβ smaller than 3.
Weβre going to tend to think that \(\mu>3\) when \(\bar{X}\) is βsignificantlyβ larger than 3.
Weβre never going to observe \(\bar{X}=3\), but we may be able to be convinced that \(\mu=3\) if \(\bar{X}\) is not too far away.
How do we formalize this stuff, We use hypothesis testing
Notationο
\(\mathrm{H}_0: \mu \leq 3\) <- Null hypothesis
\(\mathrm{H}_1: \mu>3 \quad\) Alternate hypothesis
Null hypothesisο
The null hypothesis is a hypothesis that is assumed to be true. We denote it with an \(H_0\).
Alternate hypothesisο
The alternate hypothesis is what we are out to show. The alternative hypothesis is a hypothesis that we are looking for evidence for or out to show. We denote it with an \(H_1\).
Note
Some people use the notation \(H_a\) here
Conclusion is either:
Reject \(\mathrm{H}_0 \quad\) OR \(\quad\) Fail to Reject \(\mathrm{H}_0\)
simple hypothesisο
A simple hypothesis is one that completely specifies the distribution. Do you know the exact distribution.
composite hypothesisο
You donβt know the exact distribution.
Means you know the distribution is normal but you donβt know the mean and variance.
Critical valuesο
Critical values for distributions are numbers that cut off specified areas under pdfs. For the N(0, 1) distribution, we will use the notation \(z_\alpha\) to denote the value that cuts off area \(\alpha\) to the right as depicted here.


Errors in Hypothesis Testingο
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and variance \(\sigma^2=2\)
Idea: Look at \(\bar{X}\) and reject \(H_0\) in favor of \(H _1\) if \(\overline{ X }\) is βlargeβ.
i.e. Look at \(\bar{X}\) and reject \(H_0\) in favor of \(H _1\) if \(\overline{ X }> c\) for some value \(c\).


You are a potato chip manufacturer and you want to ensure that the mean amount in 15 ounce bags is at least 15 ounces. \(\mathrm{H}_0: \mu \leq 15 \quad \mathrm{H}_1: \mu>15\)
Type I Errorο
The true mean is \(\leq 15\) but you concluded i was \(>15\). You are going to save some money because you wonβt be adding chips but you are risking a lawsuit!
Type II Errorο
The true mean is \(> 15\) but you concluded it was \(\leq 15\) . You are going to be spending money increasing the amount of chips when you didnβt have to.
Developing a Testο
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\).
Consider testing the simple versus simple hypotheses
level of significanceο
Let \(\alpha= P\) (Type I Error)
\(= P \left(\right.\) Reject \(H _0\) when itβs true \()\)
\(= P \left(\right.\) Reject \(H _0\) when \(\left.\mu=5\right)\)
\(\alpha\) is called the level of significance of the test. It is also sometimes referred to as the size of the test.
Power of the testο
\(1-\beta\) is known as the power of the test
Step Oneο
Choose an estimator for ΞΌ.
Step Twoο
Choose a test statistic or Give the βformβ of the test.
We are looking for evidence that \(H _1\) is true.
The \(N \left(3, \sigma^2\right)\) distribution takes on values from \(-\infty\) to \(\infty\).
\(\overline{ X } \sim N \left(\mu, \sigma^2 / n \right) \Rightarrow \overline{ X }\) also takes on values from \(-\infty\) to \(\infty\).
It is entirely possible that \(\bar{X}\) is very large even if the mean of its distribution is 3.
However, if \(\bar{X}\) is very large, it will start to seem more likely that \(\mu\) is larger than 3.
Eventually, a population mean of 5 will seem more likely than a population mean of 3.
Reject \(H _0\), in favor of \(H _1\), if \(\overline{ X }< c\) for some c to be determined.
Step Threeο
Find c.
If \(c\) is too large, we are making it difficult to reject \(H _0\). We are more likely to fail to reject when it should be rejected.
If \(c\) is too small, we are making it to easy to reject \(H _0\). We are more likely reject when it should not be rejected.
This is where \(\alpha\) comes in.
Step Fourο
Give a conclusion!
\(0.05= P (\) Type I Error)
\(= P \left(\right.\) Reject \(H _0\) when true \()\)
\(= P (\overline{ X }< \text{ c when } \mu=5)\)
\( = P \left(\frac{\overline{ X }-\mu_0}{\sigma / \sqrt{ n }}<\frac{ c -5}{2 / \sqrt{10}}\right.\) when \(\left.\mu=5\right)\)



Formulaο
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\).
Consider testing the simple versus simple hypotheses
where \(\mu_0\) and \(\mu_1\) are fixed and known.
Type II Errorο
Composite vs Composite Hypothesisο
Step One Choose an estimator for ΞΌ
Step Two Choose a test statistic: Reject \(H_0\) , in favor of \(H_1\) if \(\bar{π·}\) > c, where c is to be determined.
Step Three Find c.
One-Tailed Testsο
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\). Consider testing the hypotheses
where \(\mu_0\) is fixed and known.
Step Oneο
Choose an estimator for ΞΌ.
Step Twoο
Choose a test statistic or Give the βformβ of the test.
Reject \(H _0\), in favor of \(H _1\), if \(\overline{ X }< c\) for some c to be determined.
Step Threeο
Find c.
Step fourο
Reject \(H _0\), in favor of \(H _1\), if $\( \overline{ X }<\mu_0+ z _{1-\alpha} \frac{\sigma}{\sqrt{ n }} \)$
Exampleο
In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be \(\$ 6,015\).
In a more recent survey, 100 randomly sampled families of 4 reported an average annual health care premium of \(\$ 6,537\). Can we say that the true average is currently greater than \(\$ 6,015\) for all families of 4?
Assume that annual health care premiums are normally distributed with a standard deviation of \(\$ 814\). Let \(\mu\) be the true average for all families of 4.
Step Zeroο
Set up the hypotheses.
Decide on a level of significance. \( \alpha=0.10\)
Step Oneο
Choose an estimator for \(\mu\).
Step Twoο
Give the form of the test. Reject \(H _0\), in favor of \(H _1\), if
for some \(c\) to be determined.
Step Threeο
Find c.
Step Fourο
Conclusion. Reject \(H _0\), in favor of \(H _1\), if
From the data, where \(\bar{x}=6537\), we reject \(H _0\) in favor of \(H _1\).
The data suggests that the true mean annual health care premium is greater than \(\$ 6015\).
Hypothesis Testing with P-Valuesο
Recall that p-values are defined as the following: A p-value is the probability that we observe a test statistic at least as extreme as the one we calculated, assuming the null hypothesis is true. It isnβt immediately obvious what that definition means, so letβs look at some examples to really get an idea of what p-values are, and how they work.
Letβs start very simple and say we have 5 data points: x = <1, 2, 3, 4, 5>. Letβs also assume the data were generated from some normal distribution with a known variance \(\sigma\) but an unknown mean \(\mu_0\). What would be a good guess for the true mean? We know that this data could come from any normal distribution, so letβs make two wild guesses:
The true mean is 100.
The true mean is 3.
Intuitively, we know that 3 is the better guess. But how do we actually determine which of these guesses is more likely? By looking at the data and asking βhow likely was the data to occur, assuming the guess is true?β
What is the probability that we observed x=<1,2,3,4,5> assuming the mean is 100? Probabiliy pretty low. And because the p-value is low, we βreject the null hypothesisβ that \(\mu_0 = 100\).
What is the probability that we observed x=<1,2,3,4,5> assuming the mean is 3? Seems reasonable. However, something to be careful of is that p-values do not prove anything. Just because it is probable for the true mean to be 3, does not mean we know the true mean is 3. If we have a high p-value, we βfail to reject the null hypothesisβ that \(\mu_0 = 3\).
What do βlowβ and βhighβ mean? That is where your significance level \(\alpha\) comes back into play. We consider a p-value low if the p-value is less than \(\alpha\), and high if it is greater than \(\alpha\).
Exampleο
From the above example.


This is the \(N\left(6015,814^2 / 100\right)\) pdf.
The red area is \(P (\overline{ X }>6537)\).
The P-Value is the area to the right (in this case) of the test statistic \(\bar{X}\).
The P-value being less than \(0.10\) puts \(\bar{X}\) in the rejection region.
The P-value is also less than \(0.05\) and \(0.01\).
It looks like we will reject \(H _0\) for the most typical values of \(\alpha\).
Power Functionsο
Let \(X_1, X_2, \ldots, X_n\) be a random sample from any distribution with unknown parameter \(\theta\) which takes values in a parameter space \(\Theta\)
We ultimately want to test
where \(\Theta_0\) is some subset of \(\Theta\).
So in other words, if the null hypothesis was for you to test for an exponential distribution, whether lambda was between 0 and 2, the complement of that is not the rest of the real number line because the space is only non-negative values. So the complement of the interval from 0 to 2 in that space is 2 to infinity.
\(\gamma(\theta)= P \left(\right.\) Reject \(H _0\) when the parameter is \(\left.\theta\right)\)
\(\theta\) is an argument that can be anywhere in the parameter space \(\Theta\). it could be a \(\theta\) from \(H _0\) it could be a \(\theta\) from \(H _1\)
Two Tailed Testsο
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and known variance \(\sigma^2\).
Derive a hypothesis test of size \(\alpha\) for testing
We will look at the sample mean \(\bar{X} \ldots\) \(\ldots\) and reject if it is either too high or too low.
Step Oneο
Choose an estimator for ΞΌ.
Step Twoο
Choose a test statistic or Give the βformβ of the test.
Reject \(H _0\), in favor of \(H _1\) if either \(\overline{ X }< c\) or \(\bar{X}>d\) for some \(c\) and \(d\) to be determined.
Easier to make it symmetric! Reject \(H _0\), in favor of \(H _1\) if either
for some \(c\) to be determined.
Step Threeο
Find c.

Step Fourο
Conclusion
Reject \(H _0\), in favor of \(H _1\), if
Exampleο
In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be \(\$ 6,015\).
In a more recent survey, 100 randomly sampled families of 4 reported an average annual health care premium of \(\$ 6,177\). Can we say that the true average, for all families of 4 , is currently different than the sample mean from 2019? $\( \sigma=814 \quad \text { Use } \alpha=0.05 \)$
Assume that annual health care premiums are normally distributed with a standard deviation of \(\$ 814\). Let \(\mu\) be the true average for all families of 4. Hypotheses:
We reject \(H _0\), in favor of \(H _1\). The data suggests that the true current average, for all families of 4 , is different than it was in 2019.

Hypothesis Tests for Proportionsο
A random sample of 500 people in a certain country which is about to have a national election were asked whether they preferred βCandidate Aβ or βCandidate Bβ. From this sample, 320 people responded that they preferred Candidate A.
Let \(p\) be the true proportion of the people in the country who prefer Candidate A.
Test the hypotheses \(H _0: p \leq 0.65\) versus \(H _1: p>0.65\) Use level of significance \(0.10\). We have an estimate
The Modelο
Take a random sample of size \(n\). Record \(X_1, X_2, \ldots, X_n\) where \(X_i= \begin{cases}1 & \text { person i likes Candidate A } \\ 0 & \text { person i likes Candidate B }\end{cases}\) Then \(X_1, X_2, \ldots, X_n\) is a random sample from the Bernoulli distribution with parameter \(p\).
Note that, with these 1βs and 0βs, $\( \begin{aligned} \hat{p} &=\frac{\# \text { in the sample who like A }}{\# \text { in the sample }} \\ &=\frac{\sum_{ i =1}^{ n } X _{ i }}{ n }=\overline{ X } \end{aligned} \)\( By the Central Limit Theorem, \)\hat{p}=\overline{ X }$ has, for large samples, an approximately normal distribution.
So, \(\quad \hat{p} \stackrel{\text { approx }}{\sim} N\left(p, \frac{p(1-p)}{n}\right)\)
In particular, $\( \frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}} \)\( behaves roughly like a \)N(0,1)\( as \)n$ gets large.
\(n >30\) is a rule of thumb to apply to all distributions, but we can (and should!) do better with specific distributions.
\(\hat{p}\) lives between 0 and 1.
The normal distribution lives between \(-\infty\) and \(\infty\).
However, \(99.7 \%\) of the area under a \(N(0,1)\) curve lies between \(-3\) and 3 ,
Go forward using normality if the interval $\( \left(\hat{p}-3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) \)\( is completely contained within \)[0,1]$.
Step Oneο
Choose a statistic. \(\widehat{p}=\) sample proportion for Candidate \(A\)
Step Twoο
Form of the test. Reject \(H _0\), in favor of \(H _1\), if \(\hat{ p }> c\).
Step Threeο
Use \(\alpha\) to find \(c\) Assume normality of \(\hat{p}\) ? It is a sample mean and \(n>30\).
The interval $\( \left(\hat{p}-3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+3 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) \)\( is \)(0.5756,0.7044)$
Reject \(H _0\) if
Formula
T-Testsο
What is a t-test, and when do we use it? A t-test is used to compare the means of one or two samples, when the underlying population parameters of those samples (mean and standard deviation) are unknown. Like a z-test, the t-test assumes that the sample follows a normal distribution. In particular, this test is useful for when we have a small sample size, as we can not use the Central Limit Theorem to use a z-test.
There are two kinds of t-tests:
One Sample t-tests
Two Sample t-tests
Let \(X_1, X_2, \ldots, X_n\) be a random sample from the normal distribution with mean \(\mu\) and unknown variance \(\sigma^2\).
Consider testing the simple versus simple hypotheses $\( H _0: \mu=\mu_0 \quad H _1: \mu<\mu_0 \)\( where \)\mu_0$ is fixed and known.
Reject \(H _0\), in favor of \(H _1\), if
unknown!This is a useless test!
It was based on the fact that
What is we use the sample standard deviation \(S =\sqrt{ S ^2}\) in place of \(\sigma\) ?
Thus,
Step fourο
Conclusion! Reject \(H _0\), in favor of \(H _1\), if
Exampleο
In 2019, the average health care annual premium for a family of 4 in the United States, was reported to be \(\$ 6,015\).
In a more recent survey, 15 randomly sampled families of 4 reported an average annual health care premium of \(\$ 6,033\) and a sample variance of \(\$ 825\).
Can we say that the true average is currently greater than \(\$ 6,015\) for all families of 4 ?
Use \(\alpha=0.10\)
Assume that annual health care premiums are normally distributed. Let \(\mu\) be the true average for all families of 4.
Step Zeroο
Set up the hypotheses.
Step Oneο
Choose a test statistic
Step Twoο
Give the form of the test. Reject π§0 , in favor of h1, if π’ π§π£ π· > πΌ where c is to be determined.
Step Threeο
Find c

Step Fourο
Conclusion. Rejection Rule: Reject \(H _0\), in favor of \(H _1\) if
We had \(\bar{x}=6033\) so we reject \(H_0\).
There is sufficient evidence (at level \(0.10\) ) in the data to suggest that the true mean annual healthcare premium cost for a family of 4 is greater than \(\$ 6,015\).
P valueο
Two Sample Tests for Meansο
Fifth grade students from two neighboring counties took a placement exam.
Group 1, from County 1, consisted of 57 students. The sample mean score for these students was \(7 7 . 2\) and the true variance is known to be 15.3. Group 2, from County 2, consisted of 63 students and had a sample mean score of \(75.3\) and the true variance is known to be 19.7.
From previous years of data, it is believed that the scores for both counties are normally distributed.
Derive a test to determine whether or not the two population means are the same.
Suppose that \(X _{1,1}, X _{1,2}, \ldots, X _{1, n _1}\) is a random sample of size \(n_1\) from the normal distribution with mean \(\mu_1\) and variance \(\sigma_1^2\). Suppose that \(X_{2,1}, X_{2,2}, \ldots, X_{2, n_2}\) is a random sample of size \(n_2\) from the normal distribution with mean \(\mu_2\) and variance \(\sigma_2^2\).
Suppose that \(\sigma_1^2\) and \(\sigma_2^2\) are known and that the samples are independent.
Think of this as $\( \begin{gathered} \theta=0 \text { versus } \theta \neq 0 \\ \text { for } \\ \theta=\mu_1-\mu_2 \end{gathered} \)$
Step oneο
Choose an estimator for \(\theta=\mu_1-\mu_2\)
Step Twoο
Give the βformβ of the test. Reject \(H _0\), in favor of \(H _1\) if either \(\hat{\theta}>c\) or \(\hat{\theta}<-c\) for some c to be determined.
Step Threeο
Find \(c\) using \(\alpha\) Will be working with the random variable
We need to know its distributionβ¦
Step Threeο
Find c using \(\alpha\).
\(\bar{X}_1-\bar{X}_2\) is normally distributed

Step Fourο
Conclusion
Reject \(H _0\), in favor of \(H _1\), if
or
Exampleο
Suppose that \(\alpha=0.05\). $\( \begin{aligned} & z _{\alpha / 2}= z _{0.025}=1.96 \\ & z _{\alpha / 2} \sqrt{\frac{\sigma_1^2}{ n _1}+\frac{\sigma_2^2}{ n _2}}=1.49 \end{aligned} \)$
So,
and we reject \(H _0\). The data suggests that the true mean scores for the counties are different!
Two Sample t-Tests for a Difference of Meansο
Fifth grade students from two neighboring counties took a placement exam.
Group 1, from County A, consisted of 8 students. The sample mean score for these students was \(77.2\) and the sample variance is \(15.3\).
Group 2, from County B, consisted of 10 students and had a sample mean score of \(75.3\) and the sample variance is 19.7.
Pooled Varianceο
Step Fourο
Reject \(H _0\), in favor of \(H _1\), if
or
Since \(\bar{x}_1-\bar{x}_2=1.9\) is not above \(5.840\), or below \(-5.840\) we fail to reject \(H _0\), in favor of \(H _1\) at \(0.01\) level of significance.
The data do not indicate that there is a significant difference between the true mean scores for counties \(A\) and \(B\).
Welchβs Test and Paired Dataο
Two Populations: Test
Suppose that \(X_{1,1}, X_{1,2}, \ldots, X_{1, n_1}\) is a random sample of size \(n_1\) from the normal
distribution with mean \(\mu_1\) and variance \(\sigma_1^2\).Suppose that \(X_{2,1}, X_{2,2}, \ldots, X_{2, n}\) is a random sample of size \(n_2\) from the normal distribution with mean \(\mu_2\) and variance \(\sigma_2^2\).
Suppose that \(\sigma_1^2\) and \(\sigma_2^2\) are unknown and that the samples are independent. Donβt assume that \(\sigma_1^2\) and \(\sigma_2^2\) are equal!
Welch says that:
has an approximate t-distribution with \(r\) degrees of freedom where
rounded down.

Exampleο
A random sample of 6 studentsβ grades were recorded for Midterm 1 and Midterm 2. Assuming exam scores are normally distributed, test whether the true (total population of students) average grade on Midterm 2 is greater than Midterm 1. Ξ± = 0.05
Student |
Midterm 1 Grade |
Midterm 2 Grade |
---|---|---|
1 |
72 |
81 |
2 |
93 |
89 |
3 |
85 |
87 |
4 |
77 |
84 |
5 |
91 |
100 |
6 |
84 |
82 |
Student |
Midterm 1 Grade |
Midterm 2 Grade |
Differences: minus 2 Midterm 1 |
---|---|---|---|
1 |
72 |
81 |
9 |
2 |
93 |
89 |
-4 |
3 |
85 |
87 |
2 |
4 |
77 |
84 |
7 |
5 |
91 |
100. |
9 |
6 |
84 |
82 |
-2 |
The Hypotheses: Let \(\mu\) be the true average difference for all students.
This is simply a one sample t-test on the differences.
Data:
This is simply a one sample t-test on the differences.
This is simply a one sample t-test on the differences.
Reject \(H _0\), in favor of \(H _1\), if
3.5 > 4.6
Conclusion: We fail to reject h0 , in favor of h1 , at 0.05 level of significance.
These data do not indicate that Midterm 2 scores are higher than Midterm 1 scores
Comparing Two Population Proportionsο
A random sample of 500 people in a certain county which is about to have a national election were asked whether they preferred βCandidate Aβ or βCandidate Bβ. From this sample, 320 people responded that they preferred Candidate A.
A random sample of 400 people in a second county which is about to have a national election were asked whether they preferred βCandidate Aβ or βCandidate Bβ.
From this second county sample, 268 people responded that they preferred Candidate \(A\).
Test
Change to:
Estimate \(p_1-p_2\) with \(\hat{p}_1-\hat{p}_2\) For large enough samples,
and
Use estimators for p1 and p2 assuming they are the same.
Call the common value p.
Estimate by putting both groups together.
we have
Two-tailed test with z-critical valuesβ¦
= 0.9397
qnorm(1-0.05/2)
\(Z=-0.9397\) does not fall in the rejection region!
Hypothesis Tests for the Exponentialο
Suppose that \(X_1, X_2, \ldots, X_n\) is a random sample from the exponential distribution with rate \(\lambda>0\). Derive a hypothesis test of size \(\alpha\) for
What statistic should we use?
Test 1: Using the Sample Meanο
Step Oneο
Choose a statistic.
Step Twoο
Give the form of the test Reject π§0 , in favor of h1 , if π·_bar < πΌ
for some c to be determined.

Step Threeο

Step Fourο
Reject \(H _0\), in favor of \(H _1\), if
\(\chi_{\alpha, n }^2\) In R, get \(\chi_{0.10,6}^2\)
by typing qchisq(0.90,6)
Best Testο
UMP Testsο
Suppose that \(X_1, X_2, \ldots, X_n\) is a random sample from the exponential distribution with rate \(\lambda>0\).
Derive a uniformly most powerful hypothesis test of size \(\alpha\) for
Step Oneο
Consider the simple versus simple hypotheses
for some fixed \(\lambda_1>\lambda_0\).
###Steps Two, Three, and Four
Find the best test of size \(\alpha\) for
for some fixed \(\lambda_1>\lambda_0\). This test is to reject \(H _0\), in favor of \(H _1\) if
Note that this test does not depend on the particular value of \(\lambda_1\). -It does, however, depend on the fact that \(\lambda_1>\lambda_0\)
The βUMPβ test for
is to reject \(H_0\), in favor of \(H_1\) if
The βUMPβ test for
is to reject \(H_0\), in favor of \(H_1\) if
Test for the Variance of the Normal Distributionο
Suppose that \(X_1, X_2, \ldots, X_n\) is a random sample from the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Derive a test of size/level \(\alpha\) for
step 1ο
Choose a statistic/estimator for \(\sigma^2\)
step 2ο
Give the form of the test. Reject \(H_0\), in favor of \(H_1\), if
for some \(c\) to be determined.
step 3ο
find c using alpha

Step 4ο
Reject \(H _0\), in favor of \(H _1\), if
Exampleο
A lawn care company has developed and wants to patent a new herbicide applicator spray nozzle. Example: For safety reasons, they need to ensure that the application is consistent and not highly variable. The company selected a random sample of 10 nozzles and measured the application rate of the herbicide in gallons per acre
The measurements were recorded as
\(0.213,0.185,0.207,0.163,0.179\)
\(0.161,0.208,0.210,0.188,0.195\)
Assuming that the application rates are normally distributed, test the following hypotheses at level \(0.04\).
Get sample variance in \(R\).
or
Hit
Compute variance by typing
or \(\left(\left(\operatorname{sum}\left(x^{\wedge} 2\right)-\left(\operatorname{sum}(x)^{\wedge} 2\right) / 10\right) / 9\right.\) Result: \(0.000364\)
Reject \(H_0\), in favor of \(H_1\), if \(S^2>c\).
Reject \(H _0\), in favor of \(H _1\), if \(S ^2> c\)
Reject \(H_0\), in favor of \(H_1\), if \(S^2>c\)
Fail to reject \(H _0\), in favor of \(H _1\), at level 0.04. There is not sufficient evidence in the data to suggest that \(\sigma^2>0.01\).
Introductionο
Calculus is a branch of mathematics that gives tools to study rate of change of functions trough two main areas: derivatives and integrals.
In the context of machine learning and data science, you can for instance use derivatives to optimize the parameters of a model with gradient descent. You might use integrals to calculate area under the curve.
Functionsο
A function is a rule that takes one or more inputs and produces a single output. For example, the function \(f(x) = x + 1\) takes a single input \(x\), adds one to it, and produces a single output. In algebra, functions are written using symbols and formulas. For example, the function \(f(x) = x + 1\) can be written as \(f:x \rightarrow x + 1\). The input to a function is called the argument or input variable. The output is called the value or output variable.
Functions are often written using the following notation:
The notation above is read as β\(y\) equals \(f\) of \(x\)β or β\(y\) is a function of \(x\)β. The notation above is useful because it allows us to define a function without specifying its name. For example, we can define a function \(f\) as follows:
We can then use the function \(f\) to compute the square of any number. For example, \(f(2) = 2^2 = 4\) and \(f(3) = 3^2 = 9\).
Domain and Range of a Functionο
The domain of a function is the set of all possible inputs to the function. The range of a function is the set of all possible outputs of the function. For example, the function \(f(x) = x^2\) has a domain of all real numbers and a range of all non-negative real numbers. The domain of a function is often written as \(D(f)\) and the range is often written as \(R(f)\).
import seaborn as sb
func = lambda x: x ** 2
x = [-1,-2,-3, -4, 1, 2, 3, 4]
y = [func(i) for i in x]
sb.lineplot(x=x, y=y)
<Axes: >

Piecewise Functionsο
A piecewise function is a function that is defined by multiple sub-functions, each sub-function applying to a different interval of the main functionβs domain. For example, the function \(f(x) = |x|\) is defined by two sub-functions:
Expoentsο
An exponent is a number that indicates how many times a base number is multiplied by itself. For example, \(2^3\) is the same as \(2 \times 2 \times 2\) and \(2^4\) is the same as \(2 \times 2 \times 2 \times 2\). The number \(2\) is called the base and the number \(3\) is called the exponent. Exponents are often written using the following notation:
The notation above is read as βtwo to the power of threeβ or βtwo cubedβ.
Negative Exponentsο
A negative exponent indicates that the base number should be divided by itself a certain number of times. For example, \(2^{-3}\) is the same as \(\frac{1}{2^3}\) and \(2^{-4}\) is the same as \(\frac{1}{2^4}\). The number \(2\) is called the base and the number \(-3\) is called the exponent. Negative exponents are often written using the following notation:
The notation above is read as βtwo to the power of negative threeβ or βtwo to the power of minus threeβ.
Fractional Exponentsο
A fractional exponent indicates that the base number should be multiplied by itself a certain number of times. For example, \(2^{\frac{1}{2}}\) is the same as \(\sqrt{2}\) and \(2^{\frac{1}{3}}\) is the same as \(\sqrt[3]{2}\). The number \(2\) is called the base and the number \(\frac{1}{2}\) is called the exponent. Fractional exponents are often written using the following notation:
The notation above is read as βtwo to the power of one halfβ or βtwo to the power of one over twoβ.
Logarithmsο
A logarithm is the inverse of an exponent. For example, the logarithm of \(2^3\) is \(3\). The logarithm of a number \(x\) to the base \(b\) is written as \(\log_b(x)\). For example, \(\log_2(8) = 3\) because \(2^3 = 8\).
Common Logarithmsο
The common logarithm of a number \(x\) is the logarithm of \(x\) to the base \(10\). The common logarithm of \(x\) is written as \(\log(x)\). For example, \(\log(100) = 2\) because \(10^2 = 100\).
Natural Logarithmsο
The natural logarithm of a number \(x\) is the logarithm of \(x\) to the base \(e\). The natural logarithm of \(x\) is written as \(\ln(x)\). For example, \(\ln(100) = 4.60517\) because \(e^{4.60517} = 100\).
Polynomialsο
A polynomial is an expression consisting of variables and coefficients, that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponents.
For example, \(x^2 + 2x + 1\) is a polynomial because it consists of the variables \(x\) and the coefficients \(1\) and \(2\).
The degree of a polynomial is the highest degree of its terms. For example, the polynomial \(x^2 + 2x + 1\) has a degree of \(2\) because its highest degree term is \(x^2\).
Derivatives and Partial Derivativesο
Everything around us is changing, the universe is expanding, planets are moving, people are aging, even atoms donβt stay in the same state, they are always moving or changing. Everything is changing with time. So how do we measure it?
How things change?
Suppose we are going on a car trip with our family. The speed of the car is constantly changing. Similarly the temperature at any given point on a day is changing. The overall temperature of Earth is changing. So we need a way to measure that change. Letβs take the example of a family trip. Suppose the overall journey of our trip looks like this:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
params = {'legend.fontsize': 'medium',
'figure.figsize': (10, 8),
'figure.dpi': 100,
'axes.labelsize': 'medium',
'axes.titlesize':'medium',
'xtick.labelsize':'medium',
'ytick.labelsize':'medium'}
plt.rcParams.update(params)
# Sample data (replace this with your own data)
time = [1, 2, 3, 4, 5]
distance = [10, 20, 25, 35, 40]
# Create a scatter plot
sns.scatterplot(x=time, y=distance, label='Distance vs. Time')
# Create a line plot on top of the scatter plot
sns.lineplot(x=time, y=distance, color='red', label='Distance Line')
# Add labels and title
plt.xlabel('Time')
plt.ylabel('Distance')
plt.title('Distance vs. Time with Distance Line')
# Show legend
plt.legend()
# Show the plot
plt.show()

Average vs Instantaneous rate of changeο
If we look at the graph, we can see that the car covered 40 miles in 5 hours. Now the average speed or the average rate of change will be the total distance divided by total time of the whole journey which in case is
But what if we want to find the rate of change at any given time, this is where the instantaneous rate of change comes into play. The instantaneous rate of change is given by the change at any given time or point. Take the example of a speedometer which gives you the change in speed every moment. We can also calculate the instantaneous change using a graph using the concept of a slope.
Slope of a lineο
Slope of a line is simply defined as the rate of change in the vertical direction due to rate of change in the horizontal direction or simply
But what about the instantaneous rate of change? Well, if you look at a curved line, the slope will be different for different points.
This instantaneous rate of change at any point is called the derivative at that point and is defined as:
Derivative Explainedο
The derivative of a function is related to its rate of change. The rate of change tells you how much the output of the function changes when a change is done to the input. It is calculated as the ratio between a change in the output and the corresponding change in the input.
Graphically, it is the slope of the tangent at a given point of the function.
Letβs take an example. Suppose we have a function like this:
The graph of this function looks like this
# seaborn grah of x squared function
x = [-5, -4, -3, -2, -1 , 0 , 1, 2, 3, 4, 5]
func = lambda i: i**2
y = [func(i) for i in x]
sns.lineplot(x=x, y=y)
plt.show()

Now letβs say we want to find the slope or instantaneous change in y due to x. Letβs say x change from x to x+h then:
Lets say y changes due change in x so
then
since h -> 0 so h will be zero.
Partial Derivativesο
Now considering the derivative of a function with a single input, a partial derivative of a function is just the derivative of a function with multiple inputs with respect to a single variable i.e. the change in that function caused by the change in a single input. Letβs suppose a function
Now we cannot find the derivative of this function directly since it depends on two inputs. So what we do is we find the derivative of this function assuming that one of the inputs is constant. Or simply that what change in the function is caused by the slight change in that single input. Letβs find the partial derivative of this function with respect to both inputs one by 1
To calculate the partial derivatives of the given function \((f(x, y) = x^2y + \sin(y)) \) with respect to x and y, we will find the derivative of each term separately and then combine them using the rules of partial differentiation.
Partial derivative with respect to x
denoted as \(\frac{\partial f}{\partial x}\):
We treat \(y\) as a constant when taking the derivative with respect to \(x\). Therefore, we differentiate \(x^2y\) with respect to \(x\) while keeping \(y\) constant:
The derivative of \(\sin(y)\) with respect to \(x\) is 0 because \(\sin(y)\) does not depend on \(x\). So, \(\frac{\partial}{\partial x}(\sin(y)) = 0\).
Now, we can combine these partial derivatives:
So, the partial derivative of \(f(x, y)\) with respect to \(x is \)2xy$.
Partial derivative with respect to y
denoted as \(\frac{\partial f}{\partial y}\)
Now, we treat x as a constant when taking the derivative with respect to y. Therefore, we differentiate \((x^2y\) with respect to \(y\) while keeping \(x\) constant:
The derivative of \(\sin(y)\) with respect to y is \(\cos(y)\), so \(\frac{\partial}{\partial y}(\sin(y)) = \cos(y)\).
Now, we can combine these partial derivatives:
So, the partial derivative of \(f(x, y)\) with respect to y is \(x^2 + \cos(y)\).
In summary, the partial derivatives of the given function \(f(x, y) = x^2y + \sin(y)\) are:
and
Now if we want to find the partial derivative of the function at the point (-1, 2). We can just chug in values in both partial equations and find the change as:
Similarly
So we say that the change in the function with respect to input x is -4 times and with respect to y is +0.5838.
This means that the function is more sensitive to x than to y.
E.g https://www.youtube.com/watch?v=dfvnCHqzK54
https://www.youtube.com/watch?v=wqPt3qjB6uA&ab_channel=Dr.DataScience
https://www.youtube.com/watch?v=sIX_9n-1UbM
Derivative rulesο
Constant Ruleο
If you have a number (like 5 or 10) all by itself, its derivative is always 0. This means it doesnβt change when you take the derivative.
A constant represents a horizontal line on the graph, which has no slope (i.e., itβs perfectly flat). Therefore, the rate of change (derivative) is zero.
Proof: Let \(f(x) = c\), where c is a constant. Then, by definition, the derivative of \(f(x)\) is
Power Ruleο
If you have a number with an exponent (like \(x^2\) or \(x^3\)), you can bring the exponent down and subtract 1 from it. For example, if you have \(x^2\), the derivative is 2x because 2 times \(x^1\) is 2x.
The derivative of \(x^n\) with respect to x is \(nx^{n-1}\), where n is a constant. This rule is derived using the limit definition of the derivative and the binomial theorem.
Proof: Start with the limit definition of the derivative
Use the binomial theorem to expand \((x+h)^n\) $\( (x+h)^n = x^n + nx^{n-1}h + \text{higher order terms in } h \)$ Substitute this into the limit definition
Cancel the \(x^n\) terms and divide by \(h\)
Simplify and take the limit as h approaches 0: $\( \frac{d}{dx}(x^n) = nx^{n-1} \)$
Sum Ruleο
If youβre adding or subtracting two things, like f(x) + g(x) or f(x) - g(x), you can take the derivative of each thing separately and keep them separate. For example, if you have \(3x^2 + 4x\), you can find the derivative of \(3x^2\) (which is 6x) and the derivative of 4x (which is 4), and then you keep them together as 6x + 4.
Chain Ruleο
Sometimes, you have functions inside of functions. Imagine you have \(f(g(x))\). To find the derivative of that, you first find the derivative of the outer function (f) and then the derivative of the inner function (g). You multiply them together. Itβs like doing things step by step.
The chain rule is a fundamental rule in calculus that allows us to find the derivative of a composite function. In other words, it tells us how to differentiate a function that is composed of two or more functions. The chain rule is often stated as follows:
If f(u) and g(x) are differentiable functions, then the derivative of their composition f(g(x)) is given by:
Hereβs an explanation of the chain rule with an example and a proof:
Explanation:
The chain rule essentially states that to find the derivative of a composite function, you first take the derivative of the outer function with respect to its inner function and then multiply it by the derivative of the inner function with respect to the variable of interest (in this case, x).
Exampleο
Letβs use an example to illustrate the chain rule. Consider the function \(y = f(u) = u^2\) and \(u = g(x) = x^3\). We want to find the derivative of \(y\) with respect to \(x\) which is \(\frac{dy}{dx}\).
Find \(\frac{dy}{du}\): This is the derivative of the outer function \(f(u)\) with respect to its inner function \(u\), which is \(2u\).
Find \(\frac{du}{dx}\): This is the derivative of the inner function \(g(x)\) with respect to \(x\), which is \(3x^2\).
Apply the chain rule:
Now, substitute back \(u = x^3\):
So, the derivative of \(y = x^6\) with respect to \(x\) is \(6x^5\).
Proof:
The proof of the chain rule relies on the definition of the derivative and the limit concept. Letβs prove it step by step:
Start with the definition of the derivative of a function:
Now, we want to find the derivative of the composition \(f(g(x))\). Let \(v = g(x)\). So, we can write:
Using the definition of the derivative for \(f(v)\), we have:
Rewrite \(v + h\) as \(g(x + h)\) because \(v = g(x)\):
Now, we can see that this is precisely the definition of the derivative of \(f(g(x))\). So, weβve shown that:
And this can be simplified to:
So, the chain rule is proven. It tells us how to find the derivative of a composite function by considering the derivatives of its components.
Backpropagation Chain ruleο
The chain rule is a crucial concept in neural network backpropagation, which is the algorithm used to train neural networks. It allows us to efficiently calculate the gradients of the loss function with respect to the networkβs parameters (weights and biases) by decomposing the overall gradient into smaller gradients associated with each layer of the network. Letβs explain how the chain rule is used in neural network backpropagation with an example.
Suppose we have a simple feedforward neural network with one hidden layer. Hereβs a simplified network architecture:
Input layer with \(n\) neurons
Hidden layer with \(m\) neurons
Output layer with \(k\) neurons
The network has weights \(W^{(1)}\) for the connections between the input and hidden layers and weights \(W^{(2)}\) for the connections between the hidden and output layers.
The forward pass of the network involves the following steps:
Compute the weighted sum and apply an activation function to the hidden layer:
where \(X\) is the input, \(W^{(1)}\) are the weights of the first layer, \(b^{(1)}\) are the biases of the first layer, \(\sigma(\cdot)\) is the activation function (e.g., sigmoid or ReLU), and \(a^{(1)}\) is the output of the hidden layer.
Compute the weighted sum and apply an activation function to the output layer:
where \(W^{(2)}\) are the weights of the second (output) layer, \(b^{(2)}\) are the biases of the second layer, and \(a^{(2)}\) is the final output of the network.
Now, letβs assume we have a loss function \(L\) that measures the error between the predicted output \(a^{(2)}\) and the true target values \(Y\). The goal of backpropagation is to update the networkβs weights and biases to minimize this loss.
To do this, we need to compute the gradients of the loss with respect to the networkβs parameters. The chain rule comes into play during this step. We calculate the gradients layer by layer, propagating the gradient backward through the network:
Compute the gradient of the loss with respect to the output layerβs activations: $\( \frac{\partial L}{\partial a^{(2)}} \)$
Use the chain rule to calculate the gradient of the loss with respect to the output layerβs weighted sum (\(z^{(2)}\)): $\( \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \)$
Compute the gradient of the loss with respect to the second layerβs weights and biases (\(W^{(2)}\) and \(b^{(2)}\)): $\( \frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial W^{(2)}} \)\( \)\( \frac{\partial L}{\partial b^{(2)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial b^{(2)}} \)$
Use the chain rule again to calculate the gradient of the loss with respect to the hidden layerβs activations (\(a^{(1)}\)): $\( \frac{\partial L}{\partial a^{(1)}} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial a^{(1)}} \)$
Compute the gradient of the loss with respect to the hidden layerβs weighted sum (\(z^{(1)}\)): $\( \frac{\partial L}{\partial z^{(1)}} = \frac{\partial L}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \)$
Finally, calculate the gradient of the loss with respect to the first layerβs weights and biases (\(W^{(1)}\) and \(b^{(1)}\)): $\( \frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial W^{(1)}} \)\( \)\( \frac{\partial L}{\partial b^{(1)}} = \frac{\partial L}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial b^{(1)}} \)$
The chain rule allows us to compute these gradients efficiently by breaking down the overall gradient into smaller gradients associated with each layer. Once we have these gradients, we can use them to update the networkβs weights and biases using optimization algorithms like gradient descent. This iterative process of forward and backward passes, driven by the chain rule, is how neural networks are trained to learn from data.
# Importing Required Libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
#Load the dataset
data = load_iris()
data
{'data': array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'frame': None,
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n:Number of Instances: 150 (50 in each of three classes)\n:Number of Attributes: 4 numeric, predictive attributes and the class\n:Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n\n:Summary Statistics:\n\n============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n============== ==== ==== ======= ===== ====================\nsepal length: 4.3 7.9 5.84 0.83 0.7826\nsepal width: 2.0 4.4 3.05 0.43 -0.4194\npetal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\npetal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n============== ==== ==== ======= ===== ====================\n\n:Missing Attribute Values: None\n:Class Distribution: 33.3% for each of 3 classes.\n:Creator: R.A. Fisher\n:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n:Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n|details-start|\n**References**\n|details-split|\n\n- Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n- Many, many more ...\n\n|details-end|\n',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}
# Making it a 2 class problem
X_data = data["data"][:100]
Y_data = data["target"][:100]
len(X_data)
100
# Splitting the data into training and testing with 80 and 20%
X_train, X_test, Y_train, Y_test = train_test_split(X_data, Y_data, test_size = 0.2)
print(X_train.shape, X_test.shape)
(80, 4) (20, 4)
# Making their tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32).t()
Y_train_tensor = torch.tensor(Y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).t()
Y_test_tensor = torch.tensor(Y_test, dtype=torch.float32)
X_train_tensor.shape[1]
80
def init_params(n_x, n_h, n_y):
'''
This function is used to initializw the weights for the NN.
n_x : input units
n_h : hidden units
n_y : output units
It returns a dictionary which contains all the parameters
'''
W1 = torch.rand(n_x, n_h)
b1 = torch.rand(n_h, 1)
W2 = torch.rand(n_y, n_h)
b2 = torch.rand(n_y, 1)
params = {
'W1': W1,
'b1': b1,
'W2': W2,
'b2': b2
}
return params
def compute_cost(y_pred, y_actual):
'''
Uses the binary cross entropy loss to compute the cost
'''
return -(y_pred.log()* y_actual + (1-y_actual)*(1-y_pred).log()).mean()
return -1/len(y_pred) * (y_actual * torch.log(y_pred) + (1 - y_actual) * torch.log(1 - y_pred)).sum()
def forward_propagation(params, x_input):
'''
Performs the forward propagation step. Uses the parameters to predict the output A2
'''
#Extractng the parameters
W1 = params['W1']
b1 = params['b1']
W2 = params['W2']
b2 = params['b2']
# Computing the first layer
Z1 = torch.mm(W1.t(), x_input) + b1
A1 = torch.sigmoid(Z1)
# Computing the second layer
Z2 = torch.mm(W2, A1) + b2
A2 = torch.sigmoid(Z2)
# Returning the data
data = {
'Z1' : Z1,
'A1' : A1,
'Z2' : Z2,
'A2' : A2
}
return A2, data
def back_propagation(params, data, x_input, y_input, learning_rate):
'''
Performs the back propagation step. Computes the gradients and updates the parameters
'''
m = x_input.shape[1]
# Extracting the parameters
W1 = params['W1']
W2 = params['W2']
b1 = params['b1']
b2 = params['b2']
# Extrcting the required predictions of first and second layers
A1 = data['A1']
A2 = data['A2']
# Calculating the Gradients
dZ2 = A2 - y_input
dW2 = 1/m*(torch.mm(dZ2, A1.t()))
db2 = 1/m*(torch.sum(dZ2, keepdims=True, axis=1))
dZ1 = torch.mm(W2.t(), dZ2)*(1-torch.pow(A1,2))
dW1 = 1/m*(torch.mm(dZ1, x_input.t()))
db1 = 1/m*(torch.sum(dZ1, keepdims=True, axis=1))
# Updating the parameters
W1 -= learning_rate*dW1
W2 -= learning_rate*dW2
b1 -= learning_rate*db1
b2 -= learning_rate*db2
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
return parameters
def model(x_input, y_input, learning_rate = 0.01, no_iterations = 20000):
'''
Putting everything together and making the model
'''
n_x = x_input.shape[0]
n_h = 4
n_y = 1
parameters = init_params(n_x, n_h, n_y)
costs, iterations = [], []
for i in range(no_iterations):
A2, data = forward_propagation(parameters, x_input)
cost = compute_cost(A2, y_input)
cost_torch = F.binary_cross_entropy(A2, y_input.view(1, 80))
parameters = back_propagation(parameters, data, x_input, y_input, learning_rate)
if i%(no_iterations/20) == 0:
print(f'Cost at iteration {i} is {cost}')
print(f'Cost_torch at iteration {i} is {cost_torch.item()}')
costs.append(cost)
iterations.append(i)
return parameters, costs, iterations
parameters, costs, iterations = model(X_train_tensor, Y_train_tensor, 1e-4, 20000)
plt.plot(iterations, costs)
Cost at iteration 0 is 1.0869792699813843
Cost_torch at iteration 0 is 1.0869792699813843
Cost at iteration 1000 is 1.0277596712112427
Cost_torch at iteration 1000 is 1.0277596712112427
Cost at iteration 2000 is 0.9747225642204285
Cost_torch at iteration 2000 is 0.9747225642204285
Cost at iteration 3000 is 0.9278743863105774
Cost_torch at iteration 3000 is 0.9278743863105774
Cost at iteration 4000 is 0.8870576620101929
Cost_torch at iteration 4000 is 0.8870576620101929
Cost at iteration 5000 is 0.8519670367240906
Cost_torch at iteration 5000 is 0.8519670367240906
Cost at iteration 6000 is 0.8221769332885742
Cost_torch at iteration 6000 is 0.8221769332885742
Cost at iteration 7000 is 0.7971805334091187
Cost_torch at iteration 7000 is 0.7971805334091187
Cost at iteration 8000 is 0.7764268517494202
Cost_torch at iteration 8000 is 0.7764268517494202
Cost at iteration 9000 is 0.7593569755554199
Cost_torch at iteration 9000 is 0.7593569755554199
Cost at iteration 10000 is 0.7454321980476379
Cost_torch at iteration 10000 is 0.7454321980476379
Cost at iteration 11000 is 0.7341533899307251
Cost_torch at iteration 11000 is 0.7341533899307251
Cost at iteration 12000 is 0.7250728011131287
Cost_torch at iteration 12000 is 0.7250728011131287
Cost at iteration 13000 is 0.717799723148346
Cost_torch at iteration 13000 is 0.717799723148346
Cost at iteration 14000 is 0.7119990587234497
Cost_torch at iteration 14000 is 0.7119990587234497
Cost at iteration 15000 is 0.7073894143104553
Cost_torch at iteration 15000 is 0.7073894143104553
Cost at iteration 16000 is 0.703737199306488
Cost_torch at iteration 16000 is 0.703737199306488
Cost at iteration 17000 is 0.700850784778595
Cost_torch at iteration 17000 is 0.700850784778595
Cost at iteration 18000 is 0.6985741853713989
Cost_torch at iteration 18000 is 0.6985741853713989
Cost at iteration 19000 is 0.6967815160751343
Cost_torch at iteration 19000 is 0.6967815160751343
[<matplotlib.lines.Line2D at 0x7fac1775d290>]

Algebra Introductionο
This section introduces the basic concepts of algebra, including variables, constants, and functions
Functionsο
A function is a rule that takes one or more inputs and produces a single output. For example, the function \(f(x) = x + 1\) takes a single input \(x\), adds one to it, and produces a single output. In algebra, functions are written using symbols and formulas. For example, the function \(f(x) = x + 1\) can be written as \(f:x \rightarrow x + 1\). The input to a function is called the argument or input variable. The output is called the value or output variable.
Functions are often written using the following notation:
The notation above is read as β\(y\) equals \(f\) of \(x\)β or β\(y\) is a function of \(x\)β. The notation above is useful because it allows us to define a function without specifying its name. For example, we can define a function \(f\) as follows:
We can then use the function \(f\) to compute the square of any number. For example, \(f(2) = 2^2 = 4\) and \(f(3) = 3^2 = 9\).
Domain and Range of a Functionο
The domain of a function is the set of all possible inputs to the function. The range of a function is the set of all possible outputs of the function. For example, the function \(f(x) = x^2\) has a domain of all real numbers and a range of all non-negative real numbers. The domain of a function is often written as \(D(f)\) and the range is often written as \(R(f)\).
import seaborn as sb
func = lambda x: x ** 2
x = [-1,-2,-3, -4, 1, 2, 3, 4]
y = [func(i) for i in x]
sb.lineplot(x=x, y=y)
<Axes: >

Piecewise Functionsο
A piecewise function is a function that is defined by multiple sub-functions, each sub-function applying to a different interval of the main functionβs domain. For example, the function \(f(x) = |x|\) is defined by two sub-functions:
Expoentsο
An exponent is a number that indicates how many times a base number is multiplied by itself. For example, \(2^3\) is the same as \(2 \times 2 \times 2\) and \(2^4\) is the same as \(2 \times 2 \times 2 \times 2\). The number \(2\) is called the base and the number \(3\) is called the exponent. Exponents are often written using the following notation:
The notation above is read as βtwo to the power of threeβ or βtwo cubedβ.
Negative Exponentsο
A negative exponent indicates that the base number should be divided by itself a certain number of times. For example, \(2^{-3}\) is the same as \(\frac{1}{2^3}\) and \(2^{-4}\) is the same as \(\frac{1}{2^4}\). The number \(2\) is called the base and the number \(-3\) is called the exponent. Negative exponents are often written using the following notation:
The notation above is read as βtwo to the power of negative threeβ or βtwo to the power of minus threeβ.
Fractional Exponentsο
A fractional exponent indicates that the base number should be multiplied by itself a certain number of times. For example, \(2^{\frac{1}{2}}\) is the same as \(\sqrt{2}\) and \(2^{\frac{1}{3}}\) is the same as \(\sqrt[3]{2}\). The number \(2\) is called the base and the number \(\frac{1}{2}\) is called the exponent. Fractional exponents are often written using the following notation:
The notation above is read as βtwo to the power of one halfβ or βtwo to the power of one over twoβ.
Logarithmsο
A logarithm is the inverse of an exponent. For example, the logarithm of \(2^3\) is \(3\). The logarithm of a number \(x\) to the base \(b\) is written as \(\log_b(x)\). For example, \(\log_2(8) = 3\) because \(2^3 = 8\).
Common Logarithmsο
The common logarithm of a number \(x\) is the logarithm of \(x\) to the base \(10\). The common logarithm of \(x\) is written as \(\log(x)\). For example, \(\log(100) = 2\) because \(10^2 = 100\).
Natural Logarithmsο
The natural logarithm of a number \(x\) is the logarithm of \(x\) to the base \(e\). The natural logarithm of \(x\) is written as \(\ln(x)\). For example, \(\ln(100) = 4.60517\) because \(e^{4.60517} = 100\).
Polynomialsο
A polynomial is an expression consisting of variables and coefficients, that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponents.
For example, \(x^2 + 2x + 1\) is a polynomial because it consists of the variables \(x\) and the coefficients \(1\) and \(2\).
The degree of a polynomial is the highest degree of its terms. For example, the polynomial \(x^2 + 2x + 1\) has a degree of \(2\) because its highest degree term is \(x^2\).
Proof by Inductionο
A proof by induction consists of two cases. The first, the base case, proves the statement for n = 0 without assuming any knowledge of other cases. The second case, the induction step, proves that if the statement holds for any given case n = k, then it must also hold for the next case n = k + 1. These two steps establish that the statement holds for every natural number n.
What is Machine Learning?ο
What is Deep Learningο

Overviewο
Deep Learning
a type of machine learning based on artificial neural networks in which multiple layers of processing are used to extract progressively higher level features from data.
Machine Learning
development of computer systems that can learn to more accurately predict the outcomes without following explicit instructions, by using algorithms and statistical models to draw inferences from patterns in data.
Differences between Deep Learning and Machine Learningο
Machine Learningο
uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned.
needs a human to identify and hand-code the applied features based on the data type. | tries to learn features extraction and representation as well.
tend to parse data in parts, then combined those into a result (e.g. first number plate localization and then recognition).
requires relatively less data and training time
Deep learningο
structures algorithms in layers to create an βartificial neural networkβ that can learn and make intelligent decisions on its own.
tries to learn features extraction and representation as well.
Deep learning systems look at an entire problem and generate the final result in one go (e.g. outputs the coordinates and the class of object together).
requires a lot more data and training time
Applications Of Machine Learning/Deep Learningο
Email spam detection
Fingerprint / face detection & matching (e.g., phones)
Web search (e.g., DuckDuckGo, Bing, Google)
Sports predictions
ATMs (e.g., reading checks)
Credit card fraud
Stock predictions
Broad categories of Deep learningο

Perceptronο
Definitionο
Simplest artificial neuron that takes binary inputs and based on their weighted sum reaching a threshold, generates a binary output.
Artificial neuronsο
Takes in multiple inputs and learns what should be the appropriate output
Essentially a mathematical function where the weights multiplied with the inputs are learnable
Acts like a logic gate but the operation performed adjusts according to the data

connect them in a network to create an artificial brain(let)
History of the Perceptronο
Invented in 1957 by Frank Rosenblatt to binary classify an input data.
An attempt to replicate the process and ability of human nervous system.
A Biological Neuronο

McCulloch & Pitts Neuron Modelο

Computational Model of a Biological Neuronο

Terminologyο
Net input \(=\) weighted inputs, \(z\)
Activations = activation function(net input); \(a=\sigma(z)\)
Label output \(=\) threshold(activations of last layer); \(\hat{y}=f(a)\)
Special cases:
In perceptron: activation function = threshold function
In linear regression: activation \(=\) net input \(=\) output
Often more convenient notation: define bias unit as \(w_0\) and prepend a 1 to each input vector as an additional feature value
Perceptron Learning Algorithmο
Let \(\mathcal{D}=\left(\left\langle\mathbf{x}^{[1]}, y^{[1]}\right\rangle,\left\langle\mathbf{x}^{[2]}, y^{[2]}\right\rangle, \ldots,\left\langle\mathbf{x}^{[n]}, y^{[n]}\right\rangle\right) \in\left(\mathbb{R}^m \times\{0,1\}\right)^n\)
Initialize \(\mathbf{w}:=0^m \quad\) (assume notation where weight incl. bias)
For every training epoch:
For every \(\left\langle\mathbf{x}^{[i]}, y^{[i]}\right\rangle \in \mathcal{D}\) :
\(\hat{y}^{[i]}:=\sigma\left(\mathbf{x}^{[i] \top} \mathbf{w}\right)\)
err \(:=\left(y^{[i]}-\hat{y}^{[i]}\right)\)
\(\mathbf{w}:=\mathbf{w}+e r r \times \mathbf{x}^{[i]}\)
Vectorization in Pythonο
Running Computations is a Big Part of Deep Learning!
import torch
def forloop(x, w):
z = 0.
for i in range(len(x)):
z += x[i] * w[i]
return z
def listcomprehension(x, w):
return sum(x_i*w_i for x_i, w_i in zip(x, w))
def vectorized(x, w):
return x.dot(w)
x, w = torch.rand(1000), torch.rand(1000)
%timeit -r 10 -n 10 forloop(x, w)
%timeit -r 10 -n 10 listcomprehension(x, w)
%timeit -r 10 -n 10 vectorized(x, w)
8.34 ms Β± 70.5 Β΅s per loop (mean Β± std. dev. of 10 runs, 10 loops each)
6.84 ms Β± 92.6 Β΅s per loop (mean Β± std. dev. of 10 runs, 10 loops each)
The slowest run took 12.19 times longer than the fastest. This could mean that an intermediate result is being cached.
6.44 Β΅s Β± 10.1 Β΅s per loop (mean Β± std. dev. of 10 runs, 10 loops each)
Perceptron Pytorch Implementationο
Label dataο
import torch
import matplotlib.pyplot as plt
c1_mean , c2_mean = -0.5 , 0.5
c1 = torch.distributions.uniform.Uniform(c1_mean-1,c1_mean+1).sample((200,2))
c2 = torch.distributions.uniform.Uniform(c2_mean-1,c2_mean+1).sample((200,2))
features = torch.cat([c1,c2], axis=0)
labels = torch.cat([torch.zeros((200,1)), torch.ones((200,1))], axis = 0)
data = torch.cat([features, labels],axis=1)
X, y = data[:, :2], data[:, 2]
y = y.to(torch.int)
print('X.shape:', X.shape)
print('y.shape:', y.shape)
X_train, X_test = X[:300], X[100:]
y_train, y_test = y[:300], y[100:]
# Normalize (mean zero, unit variance)
mu, sigma = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mu) / sigma
X_test = (X_test - mu) / sigma
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], label='class 0', marker='o')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], label='class 1', marker='s')
plt.title('Training set')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend()
plt.show()
plt.scatter(X_test[y_test==0, 0], X_test[y_test==0, 1], label='class 0', marker='o')
plt.scatter(X_test[y_test==1, 0], X_test[y_test==1, 1], label='class 1', marker='s')
plt.title('Test set')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend()
plt.show()
X.shape: torch.Size([400, 2])
y.shape: torch.Size([400])


Train and evaluateο
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "mps")
class Perceptron:
def __init__(self, num_features):
self.num_features = num_features
self.weights = torch.zeros(num_features, 1,
dtype=torch.float32, device=device)
self.bias = torch.zeros(1, dtype=torch.float32, device=device)
self.ones = torch.ones((1, 1), device=device)
self.zeros = torch.zeros((1, 1), device=device)
def forward(self, x):
linear = torch.mm(x, self.weights) + self.bias
predictions = torch.where(linear > 0., self.ones, self.zeros)
return predictions
def backward(self, x, y):
predictions = self.forward(x)
errors = y - predictions
return errors
def train(self, x, y, epochs):
for e in range(epochs):
for i in range(y.shape[0]):
errors = self.backward(x[i].reshape(1, self.num_features), y[i]).reshape(-1)
self.weights += (errors * x[i]).reshape(self.num_features, 1)
self.bias += errors
def evaluate(self, x, y):
predictions = self.forward(x).reshape(-1)
accuracy = torch.sum(predictions == y).float() / y.shape[0]
return accuracy
ppn = Perceptron(num_features=2)
X_train_tensor = torch.tensor(X_train, dtype=torch.float32, device=device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32, device=device)
ppn.train(X_train_tensor, y_train_tensor, epochs=5)
print('Model parameters:')
print('Weights: %s' % ppn.weights)
print('Bias: %s' % ppn.bias)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32, device=device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32, device=device)
test_acc = ppn.evaluate(X_test_tensor, y_test_tensor)
print('Test set accuracy: %.2f%%' % (test_acc*100))
Vectors, Matrices, and Tensorsο
Scalars, vectors, matrices, and tensors are the fundamental data structures of deep learning. In this section, we will briefly review these concepts.

Scalar
rank-0 tensor
\(x \in \mathbb{R}\)
x = 1.23
Vector
rank-1 tensor
\( x \in \mathbb{R}^nx1 \)
Matrix
rank-2 tensor
\( x \in \mathbb{R}^{nxm} \)


import torch
t = torch.tensor([ [1, 2, 3, 4,], [6, 7, 8, 9] ])
print(t)
print(t.shape)
print(t.ndim)
print(t.dtype)
tensor([[1, 2, 3, 4],
[6, 7, 8, 9]])
torch.Size([2, 4])
2
torch.int64
Data onto the GPUο
print(torch.cuda.is_available())
print(torch.backends.mps.is_available())
if torch.cuda.is_available():
t = t.to(torch.device('cuda:0'))
print(t)
False
False
Broadcastingο
Making Vector and Matrix computations more convenient
Computing the Output From Multiple Training Examples at Onceο
The perceptron algorithm is typically considered an βonlineβ algorithm (i.e., it updates the weights after each training example)
However, during prediction (e.g., test set evaluation), we could pass all data points at once (so that we can get rid of the βforloopβ)
Two opportunities for parallelism:
computing the dot product in parallel
computing multiple dot products at once
import torch
X = torch.arange(6).view(2, 3)
print(X)
w = torch.tensor([1, 2, 3])
print(w)
print(X.matmul(w))
w = w.view(-1, 1)
print(X.matmul(w))
tensor([[0, 1, 2],
[3, 4, 5]])
tensor([1, 2, 3])
tensor([ 8, 26])
tensor([[ 8],
[26]])


This (general) feature is called βbroadcastingβ
print(torch.tensor([1, 2, 3]) + 1)
t = torch.tensor([[4, 5, 6], [7, 8, 9]])
print(t)
print( t + torch.tensor([1, 2, 3]))
tensor([2, 3, 4])
tensor([[4, 5, 6],
[7, 8, 9]])
tensor([[ 5, 7, 9],
[ 8, 10, 12]])
Notational Linear Algebraο

X = torch.arange(50, dtype=torch.float).view(10, 5)
print(X)
fc = torch.nn.Linear(in_features=5, out_features=3)
print(fc.weight)
print(fc.bias)
print(f"X dim: {X.size()}")
print(f"Weights dim: {fc.weight.size()}")
print(f"bias dim: {fc.bias.size()}")
A = fc(X)
print(f"A dim: {A.size()}")
tensor([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[10., 11., 12., 13., 14.],
[15., 16., 17., 18., 19.],
[20., 21., 22., 23., 24.],
[25., 26., 27., 28., 29.],
[30., 31., 32., 33., 34.],
[35., 36., 37., 38., 39.],
[40., 41., 42., 43., 44.],
[45., 46., 47., 48., 49.]])
Parameter containing:
tensor([[ 0.3764, -0.1360, -0.3212, -0.3096, 0.3564],
[-0.3580, -0.0980, 0.2572, -0.0816, 0.2260],
[ 0.4385, -0.3840, -0.3494, 0.3915, 0.1394]], requires_grad=True)
Parameter containing:
tensor([0.1618, 0.4385, 0.3456], requires_grad=True)
X dim: torch.Size([10, 5])
Weights dim: torch.Size([3, 5])
bias dim: torch.Size([3])
A dim: torch.Size([10, 3])
Loss Functionsο
Introductionο
Imagine the scenario, Once you developed your machine learning model that you believe, successfully identifying the cats and dogs but how do you know this is the best result?
we are looking for the metrics or a function that we can use to optimize our model performance.The loss function tells how good your model is in predictions.
If the model predictions are closer to the actual values the Loss will be minimum.
if the predictions are totally away from the original values the loss value will be the maximum.
On the basis of the Loss value, you can update your model until you get the best result.
Classificationο
Cross-Entropy or Log Lossο
Cross entropy loss is used mostly when we have a binary classification problem; that is, where the network outputs either 1 or 0.
Suppose we are given a training dataset, \(\mathbb{D}=\left\{\left(x_{i}, y_{i}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}\) and \(y_{i} \in\{0,1\}\).
We can then write this in the following form:
Here, \(\theta\) is the parameters of the network (weights and biases). We can express this in terms of a Bernoulli distribution, as follows:
The probability, given the entire dataset, is then as follows:
If we take its negative-log likelihood, we get the following:
So, we have the following:
Cross-entropy lossο
Cross entropy loss is a metric used to measure how well a classification model in machine learning performs. The loss (or error) is measured as a number between 0 and 1, with 0 being a perfect model. The goal is generally to get your model as close to 0 as possible.
Cross entropy loss measures the difference between the discovered probability distribution of a machine learning classification model and the predicted distribution.
Evaluation Metricsο
Evaluation metrics are used to evaluate the performance of a machine learning model. They provide a way to quantitatively measure how well the model is performing on a given task.
Classificationο
In machine learning, classification is a supervised learning problem in which the model is trained to predict the class of an input data point.
Note
It is important to choose an appropriate evaluation metric for your problem. For example, in a binary classification problem, you may be more interested in minimizing false negatives than false positives, in which case you would want to use a metric like recall rather than precision.
Confusion Matrixο
Confusion matrix is a performance measurement for machine learning classification problem where output can be two or more classes.
In a confusion matrix, the rows represent the actual class labels, and the columns represent the predicted class labels.

Here, TP (true positive) is the number of times the classifier predicted βpositiveβ and the actual label was βpositiveβ. FP (false positive) is the number of times the classifier predicted βpositiveβ and the actual label was βnegativeβ. FN (false negative) is the number of times the classifier predicted βnegativeβ and the actual label was βpositiveβ. TN (true negative) is the number of times the classifier predicted βnegativeβ and the actual label was βnegativeβ.
Here is an example of how to compute the values in a confusion matrix:
Predicted
Positive Negative
Actual
Positive 10 5
Negative 3 2
The values in the confusion matrix can be used to compute various performance metrics, such as precision, recall, and accuracy.

Calculate Confusion Matrix for a 2 classes problemο

Individual Number |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Actual Classification |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
Predicted Classification |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
Result |
FN |
FN |
TP |
TP |
TP |
TP |
TP |
TP |
FP |
TN |
TN |
TN |
Example from wikipedia.com
import torch
from sklearn import metrics
import matplotlib.pyplot as plt
import torchmetrics
# simulate a classification problem
y_true = torch.randint(0,2, (7,))
y_pred = torch.randint(0,2, (7,))
print(f"{y_true=}")
print(f"{y_pred=}")
print(f"confusion_matrix {metrics.confusion_matrix(y_true, y_pred)} ")
tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_pred).ravel()
print(tn, fp, fn, tp)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_true, y_pred) )
disp.plot()
plt.show()
print(f"{torchmetrics.functional.confusion_matrix(y_true, y_pred,task='binary')=}")
y_true=tensor([1, 0, 0, 0, 0, 1, 1])
y_pred=tensor([1, 0, 0, 0, 1, 0, 1])
confusion_matrix [[3 1]
[1 2]]
3 1 1 2

torchmetrics.functional.confusion_matrix(y_true, y_pred,task='binary')=tensor([[3, 1],
[1, 2]])

Precisionο
The above equation can be explained by saying, from all the classes we have predicted as positive, how many are actually positive. Precision should be high as possible.
Recall / Sensitivity / True Positive Rateο
Sensitivity tells us what proportion of the positive class got correctly classified.
The above equation can be explained by saying, from all the positive classes, how many we predicted correctly. A simple example would be to determine what proportion of the actual sick people were correctly detected by the model.
Recall should be high as possible.
Note
Precision is about your prediction. Recall is about reality.
If your job is to identify thieves.
False Negative Rateο
False Negative Rate (FNR) tells us what proportion of the positive class got incorrectly classified by the classifier.
A higher TPR and a lower FNR is desirable since we want to correctly classify the positive class.
Specificity / True Negative Rateο
Specificity tells us what proportion of the negative class got correctly classified.
Taking the same example as in Sensitivity, Specificity would mean determining the proportion of healthy people who were correctly identified by the model.
False Positive Rateο
FPR tells us what proportion of the negative class got incorrectly classified by the classifier.
A higher TNR and a lower FPR is desirable since we want to correctly classify the negative class.
Out of these metrics, Sensitivity and Specificity are perhaps the most important and we will see later on how these are used to build an evaluation metric. But before that, letβs understand why the probability of prediction is better than predicting the target class directly.

The metrics change with the changing threshold values. We can generate different confusion matrices and compare the various metrics that we discussed in the previous section. But that would not be a prudent thing to do. Instead, what we can do is generate a plot between some of these metrics so that we can easily visualize which threshold is giving us a better result.
The AUC-ROC curve solves just that problem!
print(f"{metrics.precision_score(y_true, y_pred)=}")
print(f"{metrics.recall_score(y_true, y_pred)=}")
print(f"{metrics.accuracy_score(y_true, y_pred)=}")
print(f"{metrics.f1_score(y_true, y_pred)=}")
print(f"{metrics.fbeta_score(y_true, y_pred, beta=0.5)=}")
metrics.precision_score(y_true, y_pred)=0.6666666666666666
metrics.recall_score(y_true, y_pred)=0.6666666666666666
metrics.accuracy_score(y_true, y_pred)=0.7142857142857143
metrics.f1_score(y_true, y_pred)=0.6666666666666666
metrics.fbeta_score(y_true, y_pred, beta=0.5)=0.6666666666666666
F1-scoreο
The F1-score is the harmonic mean of the precision and the recall. Using the harmonic mean has the effect that a good F1-score requires both a good precision and a good recall.
It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.
Drawbacksο
One potential drawback of the F1 score as an evaluation metric is that it is sensitive to imbalanced class distributions. This means that if one class is much more prevalent in the data than the other, the F1 score may not be a reliable indicator of the classifierβs performance.
For example, consider a binary classification problem where the positive class is rare, with only 1% of the samples belonging to that class. In this case, a classifier that simply predicts the negative class all the time would have an F1 score of 0, even though it is making the correct prediction 99% of the time. On the other hand, a classifier that makes a small number of correct predictions for the positive class (e.g., 5 out of 100) would have a relatively high F1 score, even though it is performing poorly overall.
ROC Curve and AUCο
Best explaintionο
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
https://stephenallwright.com/metric-choice/
https://mlu-explain.github.io/
Ranking | Recommendation | Information Retrievalο
None
Language Modelο
ROUGEο
ROUGE is actually a set of metrics that are used to evaluate the quality of a text summarization system.
ROUGE-N Overlap of n-grams[2] between the system and reference summaries.
ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.
ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
Linear Algebraο
Multiplying Matrices and Vectorsο
Multiplication of two A x B matrices a third matrix C.
The product operation is defined by
The value at index i,j of result matrix C is given by dot product of ith row of Matrix A with jth column of Matrix
E.g
Hadamard product & element-wise productο
Element wise of multiplication to generate another matrix of same dimension.
Dot productο
The dot product for two vectors to generate scalar value.
Identity and Inverse Matricesο
Identity Matrixο
An identity matrix is a matrix that does not change any vector when we multiply that vector by that matrix.
When βapplyβ the identity matrix to a vector the result is this same vector:
Inverse Matrixο
The inverse of a matrix \(A\) is written as \(A^{-1}\).
A matrix \(A\) is invertible if and only if there exists a matrix \(B\) such that \(AB = BA = I\).
The inverse can be found using:
Gaussian elimination
LU decomposition
Gauss-Jordan elimination
Singular Matrixο
A square matrix that is not invertible is called singular or degenerate. A square matrix is singular if and only if its determinant is zero
Normο
Norm is function which measure the size of vector.
Norms are non-negative values. If you think of the norms as a length, you easily see why it canβt be negative.
Norms are 0 if and only if the vector is a zero vector
The triangle inequality** \(u+v \leq u+v\)
The norm is what is generally used to evaluate the error of a model.
Example
The p-norm (also called \(\ell_p\)) of vector x. Let p β₯ 1 be a real number.
L1 norm, Where p = 1 \(\| x \|_1 = \sum_{i=1}^n |x_i|\)
L2 norm and euclidean norm, Where p = 2 \(\left\| x \right\|_2 = \sqrt{\sum_{i=1}^n x_i^2}\)
L-max norm, Where p = infinity
Frobenius normο
Sometimes we may also wish to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the Frobenius norm.
The Frobenius norm is the square root of the sum of the squares of all the elements of a matrix.
The squared Euclidean normο
The squared L^2 norm is convenient because it removes the square root and we end up with the simple sum of every squared values of the vector.
The squared Euclidean norm is widely used in machine learning partly because it can be calculated with the vector operation \(x^Tx\). There can be performance gain due to the optimization
The Trace Operatorο
The sum of the elements along the main diagonal of a square matrix.
Satisfies the following properties:
Transposeο
Satisfies the following properties:
Diagonal matrixο
A matrix where \(A_{ij} = 0\) if \(i \neq j\).
Can be written as \(\text{diag}(a)\) where \(a\) is a vector of values specifying the diagonal entries.
Diagonal matrices have the following properties:
Example
The eigenvalues of a diagonal matrix are the set of its values on the diagonal.
Symmetric matrixο
A square matrix \(A\) where \(A = A^T\).
Some properties of symmetric matrices are:
All the eigenvalues of the matrix are real.
Unit Vectorο
A unit vector has unit Euclidean norm.
Orthogonal Matrix or Orthonormal Vectorsο
Orthogonal Vectorsο
Two vector x and y are orthogonal if they are perpendicular to each other or dot product is equal to zero.
Orthonormal Vectorsο
when the norm of orthogonal vectors is the unit norm they are called orthonormal.
Orthonormal Matrixο
Orthogonal matrices are important because they have interesting properties. A matrix is orthogonal if columns are mutually orthogonal and have a unit norm (orthonormal) and rows are mutually orthonormal and have unit norm.
An orthogonal matrix is a square matrix whose columns and rows are orthonormal vectors.
where AT is the transpose of A and I is the identity matrix. This leads to the equivalent characterization: matrix A is orthogonal if its transpose is equal to its inverse.
so orthogonal matrices are of interest because their inverse is very cheap to compute.
Property 1
A orthogonal matrix has this property: \(A^T A = I\).
that the norm of the vector \(\begin{bmatrix} a & c \end{bmatrix}\) is equal to \(a^2+c^2\) (squared L^2). In addtion, we saw that the rows of A have a unit norm because A is orthogonal. This means that \(a^2+c^2=1\) and \(b^2+d^2=1\).
Property 2
We can show that if \(A^TA=I\) then \(A^T=A^{-1}\)
You can refer to [this question](https://math.stackexchange.com/questions/1936020/why-is-the-inverse-of-an-orthogonal-matrix-equal-to-its-transpose).
Sine and cosine are convenient to create orthogonal matrices. Letβs take the following matrix:
Eigendecompositionο
The eigendecomposition is one form of matrix decomposition (only square matrices). Decomposing a matrix means that we want to find a product of matrices that is equal to the initial matrix. In the case of the eigendecomposition, we decompose the initial matrix into the product of its eigenvectors and eigenvalues.
Eigenvectors and eigenvaluesο
Now imagine that the transformation of the initial vector gives us a new vector that has the exact same direction. The scale can be different but the direction is the same. Applying the matrix didnβt change the direction of the vector. This special vector is called an eigenvector of the matrix. We will see that finding the eigenvectors of a matrix can be very useful. Imagine that the transformation of the initial vector by the matrix gives a new vector with the exact same direction. This vector is called an eigenvector of π΄ . This means that π£ is a eigenvector of π΄ if π£ and π΄π£ are in the same direction or to rephrase it if the vectors π΄π£ and π£ are parallel. The output vector is just a scaled version of the input vector. This scalling factor is π which is called the eigenvalue of π΄ .
which means that v is well an eigenvector of A. Also, the corresponding eigenvalue is lambda=6.
Another eigenvector of π΄ is
which means that v is an eigenvector of A. Also, the corresponding eigenvalue is lambda=2.
Rescaled vectors if v is an eigenvector of A, then any rescaled vector sv is also an eigenvector of A. The eigenvalue of the rescaled vector is the same.
We have well A X 3v = lambda v and the eigenvalue is still lambda = 2 .
Concatenating eigenvalues and eigenvectorsο
Now that we have an idea of what eigenvectors and eigenvalues are we can see how it can be used to decompose a matrix. All eigenvectors of a matrix π΄ can be concatenated in a matrix with each column corresponding to each eigenvector.
The first column [ 1 1 ] is the eigenvector of π΄ with lambda=6 and the second column [ 1 -3 ] with lambda=2.
The vector \(\lambda\) can be created from all eigenvalues:
Then the eigendecomposition is given by
Converting eigenvalues and eigenvectors to a matrix A.
Real symmetric matrixο
In the case of real symmetric matrices, the eigendecomposition can be expressed as
where \(Q\) is the matrix with eigenvectors as columns and \(\Lambda\) is \(diag(\lambda)\).
This matrix is symmetric because \(A=A^T\). Its eigenvectors are:
Singular Value Decompositionο
The eigendecomposition can be done only for square matrices. The way to go to decompose other types of matrices that canβt be decomposed with eigendecomposition is to use Singular Value Decomposition (SVD).
SVD decompose π΄ into 3 matrices.
\(A = U D V^T\)
U,D,V where U is a matrix with eigenvectors as columns and D is a diagonal matrix with eigenvalues on the diagonal and V is the transpose of U.
The matrices U,D,V have the following properties:
U and V are orthogonal matrices U^T=U^{-1} and V^T=V^{-1}
D is a diagonal matrix However D is not necessarily square.
The columns of U are called the left-singular vectors of A while the columns of V are the right-singular vectors of A.The values along the diagonal of D are the singular values of A.
Intuitionο
I think that the intuition behind the singular value decomposition needs some explanations about the idea of matrix transformation. For that reason, here are several examples showing how the space can be transformed by 2D square matrices. Hopefully, this will lead to a better understanding of this statement: π΄ is a matrix that can be seen as a linear transformation. This transformation can be decomposed in three sub-transformations: 1. rotation, 2. re-scaling, 3. rotation. These three steps correspond to the three matrices π , π· , and π.
SVD and eigendecompositionο
Now that we understand the kind of decomposition done with the SVD, we want to know how the sub-transformations are found. The matrices π , π· and π can be found by transforming π΄ in a square matrix and by computing the eigenvectors of this square matrix. The square matrix can be obtain by multiplying the matrix π΄ by its transpose in one way or the other:
π corresponds to the eigenvectors of π΄π΄^T π corresponds to the eigenvectors of π΄^Tπ΄ π· corresponds to the eigenvalues π΄π΄^T or π΄^Tπ΄ which are the same.
The Moore-Penrose Pseudoinverseο
We saw that not all matrices have an inverse because the inverse is used to solve system of equations. The Moore-Penrose pseudoinverse is a direct application of the SVD. the inverse of a matrix A can be used to solve the equation Ax=b.
But in the case where the set of equations have 0 or many solutions the inverse cannot be found and the equation cannot be solved. The pseudoinverse is \(A^+\) where \(A^+\) is the pseudoinverse of \(A\).
The following formula can be used to find the pseudoinverse:
Principal Components Analysis (PCA)ο
The aim of principal components analysis (PCA) is generaly to reduce the number of dimensions of a dataset where dimensions are not completely decorelated.
Describing the problemο
The problem can be expressed as finding a function that converts a set of data points from βπ to βπ . This means that we change the number of dimensions of our dataset. We also need a function that can decode back from the transformed dataset to the initial one.

The first step is to understand the shape of the data. π₯(π) is one data point containing π dimensions. Letβs have π data points organized as column vectors (one column per point):
\(x=\begin{bmatrix} x^{(1)} x^{(2)} \cdots x^{(m)} \end{bmatrix}\)
If we deploy the n dimensions of our data points we will have:
We can also write:
c will have the shape:
Adding some constraints: the decoding functionο
The encoding function f(x) transforms x into c and the decoding function transforms back c into an approximation of x. To keep things simple, PCA will respect some constraints:
Constraint 1
The decoding function has to be a simple matrix multiplication:
$$g(c)=Dc$$
By applying the matrix D to the dataset from the new coordinates system we should get back to the initial coordinate system.
Constraint 2
The columns of D must be orthogonal.
Constraint 3
The columns of D must have unit norm.
Finding the encoding functionο
For now we will consider only one data point. Thus we will have the following dimensions for these matrices (note that x and c are column vectors)

We want a decoding function which is a simple matrix multiplication. For that reason, we have g(c)=Dc.
We will then find the encoding function from the decoding function. We want to minimize the error between the decoded data point and the actual data point.
With our previous notation, this means reducing the distance between x and g(c). As an indicator of this distance, we will use the squared L^2 norm.
This is what we want to minimize. Letβs call \(c^*\) the optimal c. Mathematically it can be written:
$$ c^* = argmin ||x-g(c)||_2^2 $$
This means that we want to find the values of the vector c such that \(||x-g(c)||_2^2\) is as small as possible.
the squared \(L^2\) norm can be expressed as:
$$ ||y||_2^2 = y^Ty $$
We have named the variable y to avoid confusion with our x. Here \(y=x - g(c)\) Thus the equation that we want to minimize becomes:
$$ (x - g(c))^T(x - g(c)) $$
Since the transpose respects addition we have:
$$ (x^T - g(c)^T)(x - g(c)) $$
By the distributive property we can develop:
$$ x^Tx - x^Tg(c) - g(c)^Tx + g(c)^Tg(c) $$
The commutative property tells us that \(x^Ty = y^Tx\). We can use that in the previous equation: we have \(g(c)^Tx = x^Tg(c)\). So the equation becomes:
$$ x^Tx -x^Tg(c) -x^Tg(c) + g(c)^Tg(c) = x^Tx -2x^Tg(c) + g(c)^Tg(c) $$
The first term \(x^Tx\) does not depends on c and since we want to minimize the function according to c we can just get off this term. We simplify to:
$$ c^* = arg min -2x^T g(c) + g(c)^T g(c) $$
Since \(g(c)=Dc\):
$$ c^* = arg min -2x^T Dc + (Dc)^T Dc $$
With \((Dc)^T = c^T D^T\), we have:
$$ c^* = arg min -2x^T Dc + c^T D^T Dc $$
As we knew, \(D^T D= I_l\) because D is orthogonal. and their columns have unit norm. We can replace in the equation:
$$ c^* = arg min -2x^T Dc + c^T I_l c $$
$$ c^* = arg min -2x^T Dc + c^T c $$
Minimizing the functionο
Now the goal is to find the minimum of the function β2π₯Tπ·π+πTπ. One widely used way of doing that is to use the gradient descent algorithm. The main idea is that the sign of the derivative of the function at a specific value of π₯ tells you if you need to increase or decrease π₯ to reach the minimum. When the slope is near 0 , the minimum should have been reached.
Its mathematical notation is βπ₯π(π₯).
Here we want to minimize through each dimension of c. We are looking for a slope of 0.
Statisticsο
Mean, Variance and Standard Deviationο
Meanο
The mean of a vector, usually denoted as \(\mu\) , is the mean of its elements, that is to say the sum of the components divided by the number of components
Varianceο
The variance is the mean of the squared differences to the mean.
with \(var(x)\) being the variance of the variable \(x\), \(n\) the number of data samples, \(x_i\) the ith data sample and \(\bar{x}\) the mean of \(x\).
Standard Deviationο
The standard deviation is simply the square root of the variance. It is usually denoted as \(\sigma\):
We square root the variance to go back to the units of the observations.
Both the variance and the standard deviation are dispersion indicators: they tell you if the observations are clustered around the mean.
Note also that the variance and the standard deviation are always positive (it is like a distance, measuring how far away the data points are from the mean):
Covariance and Correlationο
Correlationο
The correlation, usually refering to the Pearsonβs correlation coefficient, is a normalized version of the covariance. It is scaled between -1 and 1