__This page works best with desktop site(the canvas graph below!)__
Many a times Deep Neural Networks trained on large datasets have remarkable accuracy. But sometimes they can’t predict as accurate, these might be due to limited training data, poor generalization, or because of noise in data. In many such cases, models whose main goal is to predict something that will have a long term impact on decision making representing uncertainty is important.
Understanding what a model does not know is a critical part of many machine learning systems. Unfortunately, today’s deep learning algorithms are usually unable to understand their uncertainty. These models are often taken blindly and assumed to be accurate, which is not always the case.
For example : An image classification system erroneously identified two African American humans as gorillas, raising concerns of racial discrimination. Read the report here.
BDL(Bayesian Deep Learning)
Most materials in this post have been taken from two papers below by Alex Kendal and Yarin Gal.
We will discuss the later part of uncertainty i.e model generated in this blog.
Epistemic Uncertainty is caused when model ignores certain effects or because a particular part of data is hidden. This happens mostly due to low variance in training samples.
Useful in:
Aleatoric Uncertainty Aleatoric uncertainty captures the uncertainty with respect to information which our data cannot explain. For example, aleatoric uncertainty in images can be attributed to occlusions (because cameras can’t see through objects).
Useful in :
Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation. You can notice that aleatoric uncertainty captures object boundaries where labels are noisy. The bottom row shows a failure case of the segmentation model, when the model is unfamiliar with the footpath, and the corresponding increased epistemic uncertainty.
Bayesian deep learning is a field at the intersection between deep learning and Bayesian probability theory. Bayesian deep learning models typically form uncertainty estimates by either placing distributions over model weights, or by learning a direct mapping to probabilistic outputs.
Heteroscedastic uncertainty model In this model we replace the Euclidean Loss (Loss=||y−y^||2
) with
Loss=||y−y^||22σ2+12logσ2
The model predicts a mean y^ and variance σ2. As you can see from this equation, if the model predicts something very wrong, then it will be encouraged to attenuate the residual term, by increasing uncertainty σ2. However, the logσ2 prevents the uncertainty term growing infinitely large. This can be thought of as learned loss attenuation.
Epistemic uncertainty is much harder to model. This requires us to model distributions over models and their parameters which is much harder to achieve at scale. A popular technique to model this is Monte Carlo dropout sampling which places a Bernoulli distribution over the network’s weights.
Next an example to deal with model uncertainty on an ensemble model for SPY 500 prediction using keras backend. Find the full notebook here : MLSPY500.nb
`
def model_uncertainity2(model, x_test, y_test, B, confidence):
MC_output = K.function([model.layers[0].input, K.learning_phase()], [model.layers[-1].output])
learning_phase = True # use dropout at test time
MC_samples = [MC_output([x_test, learning_phase])[0] for _ in range(B)]
MC_samples = np.array(MC_samples)
eta1 = np.mean(MC_samples - (np.mean(MC_samples)**2))#model misspecification and model uncertainity
eta2 = mean_squared_error(model.predict(x_test), y_test) #inherent noise
model_uncer = np.sqrt(eta1**2 + eta2**2)
Merror = (st.norm.ppf((1+(confidence/100))/2))*model_uncer
return Merror
Parameters
'''
:param model: model class object; lstm or gru
:x_test: array, test X sample
:y_test: array, test y sample
:B: int, beta factor for number of iterations for Monte Carlo
:confidence: percent;int , percent factor of uncertainity
''' ` - Getting uncertainity in any model.
Theory:
p(yi/xi,Xtrain,Ytrain)=∫p(yi/xi,ω)
p(ω/Xtrain,Ytrain)
dω≈∫p(yi/xi,ω)qθ(ω)dω
=:qθ(yi/xi)
we have that yi is a draw from an approximation to the predictive distribution.
This process is equivalent to drawing a new function for each test point, which results in extremely erratic depictions that have peaks at different locations
Drawing a new function for each test point makes no difference if all we care about is obtaining the predictive mean and predictive variance, but this process does not result in draws from the induced distribution over functions
The result is shown below.
<!DOCTYPE html>
Here we get a unceratinty rate of 75% as input and the output follows.
To get uncertainty intervals you either:
References :