Intro¶
What do you think the role of machine learning? We have data set and want to know the precise probability distribution.
But the size of data is limited so it's impossible to find the exact probability distribution. So we make a machine learning model decided by parameter, and we adjust the values of parameter. The aim of machine learning is make the probability distribution close to real probability distribution.
Contents¶
- aim of machine learning
- model parameter as a probability distribution
- relation between posterior, prior, likelihood
- likelihood and ML
- MLE: maximum likelihood estimation
- MLE solution
- MAP: Maximum A Posterior estimation
- MLE vs MAP
model parameter as a probability distribution¶
Let's think a linear function.
$y = f(x) = ax+b \space$ $\space a,b \in \mathbb R$
Any dot in $\R$, function y = ax + b is determined only. In other words, Any dots in $\R^2$ corresponds with other elements in the function space consisted with linear functions.
We call the $\R^2$ space parameter space.
# Drawing y = ax + b (-10 ~ 10)
# to search: gca, tight_layout , flatten why
import numpy as np
import matplotlib.pyplot as plt
parameter_points = []
fig1, axes1 = plt.subplots(2, 5, figsize=(10, 4))
for ax in axes1.flatten():
# np.random.uniform: 정해진 구간에서 수를 무작위로 추출하여 반환합니다.
a, b = np.random.uniform(-10, 10, size=2)
a = round(a, 3)
b = round(b, 3)
parameter_points.append((a, b))
x = np.linspace(-10, 10, 50)
y = a*x + b
ax.plot(x, y)
ax.set_title('y='+str(a)+'x'+'{0:+.03f}'.format(b))
ax.set_xlim(-10, 10)
ax.set_ylim(-10, 10)
plt.tight_layout()
px, py = np.split(np.array(parameter_points), 2, axis=1)
fig2 = plt.figure()
axes2 = plt.gca()
axes2.set_title('samples from parameter space')
axes2.set_xlim(-10, 10)
axes2.set_ylim(-10, 10)
plt.scatter(px, py)
plt.show()
# Drawing y = ax + b (mean=(1,0), standard deviation = 0.5)
parameter_points = []
fig, axes1 = plt.subplots(2, 5, figsize=(10, 4))
for ax in axes1.flatten():
# np.random.normal: 정규분포를 따르는 확률 변수의 랜덤한 값을 반환합니다.
a, b = np.random.normal(loc=[1, 0], scale=0.5)
a = round(a, 3)
b = round(b, 3)
parameter_points.append((a, b))
x = np.linspace(-10, 10, 50)
y = a*x + b
ax.plot(x, y)
ax.set_title('y='+str(a)+'x'+'{0:+.03f}'.format(b))
ax.set_xlim(-10, 10)
ax.set_ylim(-10, 10)
plt.tight_layout()
px, py = np.split(np.array(parameter_points), 2, axis=1)
fig2 = plt.figure()
axes2 = plt.gca()
axes2.set_title('samples from parameter space')
axes2.set_xlim(-10, 10)
axes2.set_ylim(-10, 10)
plt.scatter(px, py)
plt.show()
Posterior, Prior, likelihood¶
Bayesian machine learning model learns probability distribution through data. Watching the model parameters as not fix values but probability variables which have uncertainty is the main idea of Bayesian machine learning.
There's a set of data X
and the probability distribution p(x)
. Our aim is to find a linear
model y = $\theta^T x$ which represents p(x) the most.
Prior probability: In Bayesian statistical inference, is the probability of an event before new data is collected.
Likelihood: A finite set of possible outcomes, given a probability. Likelihood is about an infinite set of possible probabilities, given an outcome.
- It means that the probability when distribution of parameter is determined.
- High likelihood = High probability of being the data in our parameter condition.
- MLE: Train model to maximize the
likelihood
of data - MAP: Train model to maximize the
posterior
of data
Relation between Posterior, Prior, likelihood¶
By the multiplication Theorem, we can represent probability distribution X and p(X,$\theta$) which is joint probability of $\theta$.
And we divide p(X) at both side.
likelihood & machine learning¶
We see a noise as label $y_n$ minus prediction $y=\theta^T x$. If X is set of input data and Y is a set of label, likelihood would be $p(Y|\theta, X)$.
Now let's think about likelihood of one data $p(y_n|\theta, x_n)$. How do we think the distribution of output?
If we assume a noise distribution as a normal distribution which has mean 0 and standard deviation is $\sigma$, mean of distribution would be mean $\theta^T x_n$ and standard deviation $\sigma$.
Likelihood¶
import math
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(321)
input_data = np.linspace(-2, 2, 5)
label = input_data + 1 + np.random.normal(0, 1, size=5)
plt.scatter(input_data, label)
plt.show()
# model: y = ax + b
# a, b 값을 바꾸면서 실행해보세요
#-------------------------------#
a = 1
b = 1
#-------------------------------#
# 모델 예측값
model_output = a*input_data + b
likelihood = []
# x: 입력데이터, y: 데이터라벨
# 예측값과 라벨의 차이를 제곱해 exp에 사용
for x, y, output in zip(input_data, label, model_output):
likelihood.append(1/(math.sqrt(2*math.pi*0.1*0.1))*math.exp(-pow(y-output,2)/(2*0.1*0.1)))
model_x = np.linspace(-2, 2, 50)
model_y = a*model_x + b
fig, ax = plt.subplots()
ax.scatter(input_data, label)
ax.plot(model_x, model_y)
for i, text in enumerate(likelihood):
ax.annotate('%.3e'%text, (input_data[i], label[i]))
plt.show()
# searched: annotate, pow
# model: y = ax + b
# a, b 값을 바꾸면서 실행해보세요
#-------------------------------#
a = 1
b = 1
#-------------------------------#
# 모델 예측값
model_output = a*input_data + b
likelihood = []
# x: 입력데이터, y: 데이터라벨
# 예측값과 라벨의 차이를 제곱해 exp에 사용
for x, y, output in zip(input_data, label, model_output):
likelihood.append(1/(math.sqrt(2*math.pi*0.1*0.1))*math.exp(-pow(y-output,2)/(2*0.1*0.1))) # pow = power
model_x = np.linspace(-2, 2, 50)
model_y = a*model_x + b
fig, ax = plt.subplots()
ax.scatter(input_data, label)
ax.plot(model_x, model_y)
for i, text in enumerate(likelihood):
ax.annotate('%.3e'%text, (input_data[i], label[i])) # 주석 = annotate
plt.show()
Why is likelihood important?¶
As you saw in example, The further the data point from model function, the lower likelihood of data exponentially.
So MLE(Maximum likelihood estimation) tries to find parameters which maximize the likelihood of data.
MLE¶
likelihood of all dataset¶
Good machine learning model maximize likelihood not only each data, but also all the data points in dataset. How can we calculate the likelihood of entire dataset?
First, let's say our data points are i.i.d(independent and identically distributed)
. As all the data points are independent, the likelihood of all dataset is same with multiplication of all likelihoods of dataset.
And normally we find a parameter which maximizes log likelihood. Because differential computation is more comfortable when we use log. We just add them instead of multiplying in likelihood. And log function is monotonically increasing, it doesn't affect the result of training.
Sometimes we use negative log likelihood, too.
This function is just the same with least square! It's basically same with finding the parameter which minimize negative log likelihood.
This is the function **L($\theta$) that we should minimize.
The minimum theta is same with the solution of L'($\theta$) = 0. L($\theta$) is quadratic equation about $\theta$, we have only one minimum value.
maximum log likelihood¶
So the optimal parameter is like this.
MLE optimal solution¶
Make dataset¶
import math
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
num_samples = 20
input_data = np.linspace(-2, 2, num_samples)
labels = input_data + 1 + np.random.normal(0, 0.5, size=num_samples)
plt.scatter(input_data, labels)
plt.show()
likelihood and parameter¶
Let's say we assumed the standard deviation as 0.1.
And the thing we need to care is that data is not a scalar, but a column vector which is two dimensional column vector. So X is 20 x 2 matrix.
def likelihood(labels, preds):
result = 1/(np.sqrt(2*math.pi*0.1*0.1))*np.exp(-np.power(labels-preds,2)/(2*0.1*0.1))
return np.prod(result)
def neg_log_likelihood(labels, preds):
const_term = len(labels)*math.log(1/math.sqrt(2*math.pi*0.1*0.1))
return (-1)*(const_term + 1/(2*0.1*0.1)*np.sum(-np.power(labels-preds,2)))
# searched: linalg, inv
# X: 20x2 matrix, y: 20x1 matrix
# input_data 리스트를 column vector로 바꾼 다음 np.append 함수로 상수항을 추가합니다.
X = np.append(input_data.reshape((-1, 1)), np.ones((num_samples, 1)), axis=1)
y = labels
theta_1, theta_0 = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y) #linalg = linear algebra #inv = inverse
print('slope: '+'%.4f'%theta_1+' bias: '+'%.4f'%theta_0)
predictions = theta_1 * input_data + theta_0
print('likelihood: '+'%.4e'%likelihood(labels, predictions))
print('negative log likelihood: '+'%.4e'%neg_log_likelihood(labels, predictions))
model_x = np.linspace(-2, 2, 50)
model_y = theta_1 * model_x + theta_0
plt.scatter(input_data, labels)
plt.plot(model_x, model_y)
plt.show()
slope: 0.8578 bias: 1.2847 likelihood: 2.9724e-54 negative log likelihood: 1.2325e+02
- MAP: Maximum A Posterior estimation¶
prior distribution¶
There's one thing that we can see in this equation. MLE's optimal solution rely on only seen data. However, if there's outlier a lot in dataset, model's stability will be low.
MAP finds the parameter maximizes the probability in p($\theta$ | X).
- In supervised learning
prior distribution p($\theta$) is a probability distribution when there's no seen data. Let's assume p($\theta$) as mean (0,0), covariance $\sigma$ = $\alpha^2 I$.
Same as in MLE, we will find the parameter which minimize negative log posterior.
- P($\theta$)
One different thing with MLE's solution is that MAP's solution has $(\sigma^2/alpha^2)I$! What does this term mean?
MAP as L2 Relation¶
We saw the first term when we looked MLE. Now let's see the last term.
$\theta^T \theta$ is same with **square of L2 norm, $||\theta||^2$. Lets' say $\lambda$ as $1/2\alpha^2$.
MLE vs MAP¶
Making dataset¶
MAP is similar with MLE but has negative log prior term. That's why MAP model is more stable than MLE model. Let's compare them using with dataset which has outlier.
import math
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
num_samples = 10
input_data = np.linspace(-2, 2, num_samples)
labels = input_data + 1 + np.random.normal(0, 0.5, size=num_samples)
input_data = np.append(input_data, [0.5, 1.5]) # append [0.5, 1.5] in input_data list
labels = np.append(labels, [9.0, 10.0])
plt.scatter(input_data, labels)
plt.show()
Calculate parameter¶
These are the optimial parameter of each functions.
- Assume: $\sigma = 0.1$, standard deviation of parameter distribution $\alpha = 0.04$
The smaller $\alpha$ is, the the stronger constrains about the parameter.
def likelihood(labels, preds):
result = 1/(np.sqrt(2*math.pi*0.1*0.1))*np.exp(-np.power(labels-preds,2)/(2*0.1*0.1))
return np.prod(result)
def neg_log_likelihood(labels, preds):
const_term = len(labels)*math.log(1/math.sqrt(2*math.pi*0.1*0.1))
return (-1)*(const_term + 1/(2*0.1*0.1)*np.sum(-np.power(labels-preds,2)))
# X: 21x2 matrix, y: 21x1 matrix
# input_data 리스트를 column vector로 바꾼 다음 np.append 함수로 상수항을 추가합니다.
X = np.append(input_data.reshape((-1, 1)), np.ones((num_samples+2, 1)), axis=1)
y = labels
# MLE 파라미터 계산식
mle_theta_1, mle_theta_0 = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y)
# MAP 파라미터 계산식
map_theta_1, map_theta_0 = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)+(0.1*0.1)/(0.04*0.04)*np.eye(2)), X.T), y)
print('[MLE result] (blue)')
print('slope: '+'%.4f'%mle_theta_1+' bias: '+'%.4f'%mle_theta_0)
mle_preds = mle_theta_1 * input_data + mle_theta_0
print('likelihood: '+'%.4e'%likelihood(labels, mle_preds))
print('negative log likelihood: '+'%.4e\n'%neg_log_likelihood(labels, mle_preds))
print('[MAP result] (orange)')
print('slope: '+'%.4f'%map_theta_1+' bias: '+'%.4f'%map_theta_0)
map_preds = map_theta_1 * input_data + map_theta_0
print('likelihood: '+'%.4e'%likelihood(labels, map_preds))
print('negative log likelihood: '+'%.4e'%neg_log_likelihood(labels, map_preds))
model_x = np.linspace(-2, 2, 50)
mle_model_y = mle_theta_1 * model_x + mle_theta_0
map_model_y = map_theta_1 * model_x + map_theta_0
plt.scatter(input_data, labels)
plt.plot(model_x, mle_model_y)
plt.plot(model_x, map_model_y)
plt.show()
[MLE result] (blue) slope: 1.4748 bias: 2.4784 likelihood: 0.0000e+00 negative log likelihood: 4.1298e+03 [MAP result] (orange) slope: 1.1719 bias: 1.6628 likelihood: 0.0000e+00 negative log likelihood: 4.6645e+03
You can compare the negative log likelihood of MLE, MAP. MAP's value is bigger than that of MLE(=MAP's likelihood is smaller), but the change when outlier appended is smaller than MLE!
'Computer Science > AI Fundamental' 카테고리의 다른 글
[F-29] Scikit-Learn (0) | 2022.02.22 |
---|---|
[F-28] Information Theory (0) | 2022.02.22 |
[F-26] Activation Function (0) | 2022.02.22 |
[F-23] Convolution (0) | 2022.02.22 |
[F-22] Deep Network (0) | 2022.02.22 |