Tensorflow Train Network Modify It and Train It Again

In this tutorial, you volition learn how to apply Keras to train a neural network, stop training, update your learning rate, and and so resume training from where you left off using the new learning rate. Using this method y'all can increase your accuracy while decreasing model loss.

Today'south tutorial is inspired by a question I received from PyImageSearch reader, Zhang Min.

Zhang Min writes:

Hi Adrian, thanks for the PyImageSearch weblog. I have two questions:

Beginning, I am working on my graduation projection and my university is allowing me to share time on their GPU machines. The problem is that I tin only admission a GPU machine in ii 60 minutes increments — later my two hours is upwards I'm automatically booted off the GPU. How tin I save my training progress, safely stop grooming, and then resume training from where I left off?

Secondly, my initial experiments aren't going very well. My model apace jumps to 80%+ accuracy just and then stays at that place for another 50 epochs. What else can I exist doing to improve my model accuracy? My advisor said I should look into adjusting the learning rate just I'm not actually certain how to do that.

Cheers Adrian!

Learning how to start, cease, and resume training a deep learning model is a super important skill to principal — at some point in your deep learning practitioner career you'll see a situation similar to Zhang Min'southward where:

You lot take express fourth dimension on a GPU instance (which can happen on Google Colab or when using Amazon EC2'due south cheaper spot instances).
Your SSH connectedness is broken and you forgot to employ a concluding multiplexer to save your session (such as screen or tmux).
Your deep learning rig locks up and forces shuts downward.

Only imagine spending an entire week to train a state-of-the-art deep neural network…but to have your model lost due to a power failure!

Luckily, at that place'southward a solution — merely when those situations happen you need to know how to:

Have a snapshotted model that was saved/serialized to deejay during training.
Load the model into memory.
Resume preparation from where you left off.

Secondly, starting, stopping, and resume training is standard practice when manually adjusting the learning rate:

Start grooming your model until loss/accurateness plateau
Snapshot your model every Northward epochs (typically N={1, v, 10})
Stop training, normally by forcefulness exiting via ctrl + c
Open your lawmaking editor and adjust your learning rate (typically lowering it by an order
of magnitude)
Get back to your final and restart the training script, picking upwardly from the concluding
snapshot of model weights

Using this ctrl + c method of grooming you can boost your model accuracy while simultaneously driving down loss, leading to a more accurate model.

The ability to adapt the learning rate is a critical skill for any deep learning practitioner to master, and so take the fourth dimension now to study and practice it!

To larn how to start, stop, and resume training with Keras, merely proceed reading!

Looking for the source code to this post?

Bound Right To The Downloads Department

Keras: Starting, stopping, and resuming training

2020-06-05 Update: This weblog post is now TensorFlow 2+ compatible!

In the first function of this blog post, we'll discuss why we would desire to beginning, finish, and resume training of a deep learning model.

We'll also discuss how stopping training to lower your learning charge per unit can improve your model accurateness (and why a learning charge per unit schedule/decay may not be sufficient).

From at that place we'll implement a Python script to handle starting, stopping, and resuming grooming with Keras.

I'll then walk you through the entire training process, including:

Starting the initial training script
Monitoring loss/accuracy
Noticing when loss/accurateness is plateauing
Stopping training
Lowering your learning rate
Resuming training from where y'all left off with the new, lowered learning charge per unit

Using this method of training you'll often be able to amend your model accuracy.

Permit'due south go alee and become started!

Why exercise we need to start, stop, and resume preparation?

There are a number of reasons you lot may demand to start, stop, and resume preparation of your deep learning model, but the ii primary grounds include:

Your preparation session existence terminated and training stopping (due to a ability outage, GPU session timing out, etc.).
Needing to adjust your learning rate to improve model accuracy (typically by lowering the learning rate by an order of magnitude).

The second point is particularly important — if y'all go back and read the seminal AlexNet, SqueezeNet, ResNet, etc. papers you'll discover that the authors all say something forth the lines of:

We started training our model with the SGD optimizer and an initial learning charge per unit of 1e-1. We reduced our learning charge per unit by an order of magnitude on epochs 30 and 50, respectively.

Why is the drop-in learning rate and then important? And how can it pb to a more than authentic model?

To explore that question, take a await at the following plot of ResNet-eighteen trained on the CIFAR-10 dataset:

**Effigy 1:** Training ResNet-18 on the CIFAR-10 dataset. The characteristic drops in loss and increases in accuracy are axiomatic of learning rate changes. Here, (ane) training was stopped on epochs thirty and 50, (2) the learning charge per unit was lowered, and (3) training was resumed. (epitome source)

Find for epochs 1-29 at that place is a fairly "standard" bend that you come up across when training a network:

Loss starts off very loftier merely so rapidly drops
Accuracy starts off very low only then speedily rises
Somewhen loss and accuracy plateau out

But what is going on around epoch 30?

Why does the loss drop so dramatically? And why does the accuracy ascent and so considerably?

The reason for this behavior is because:

Grooming was stopped
The learning rate was lowered by an social club of magnitude
And so training was resumed

The same goes for epoch 50 also — once more, grooming was stopped, the learning rate lowered, and and then training resumed.

Each time we see a characteristic drop in loss and and then a small increase in accuracy.

As the learning rate becomes smaller, the impact of the learning rate reduction has less and less impact.

Somewhen, we meet two bug:

The learning rate becomes very small which in plow makes the weight updates very minor and thus the model cannot make any meaningful progress.
Nosotros beginning to overfit due to the small learning rate. The model descends into areas of lower loss in the loss landscape, overfitting to the training data and not generalizing to the validation data.

The overfitting behavior is evident by epoch l in Effigy 1 above.

Notice how validation loss has plateaued and is fifty-fifty started to rise a chip. And the same fourth dimension grooming loss is continuing to driblet, a clear sign of overfitting.

Dropping your learning charge per unit is a great way to heave the accurateness of your model during preparation, just realize there is (i) a point of diminishing returns, and (2) a take a chance of overfitting if preparation is non properly monitored.

Why not use learning rate schedulers or decay?

**Figure 2:** Learning rate schedulers are great for some training applications; nonetheless, starting/stopping Keras training typically leads to more control over your deep learning model.

Y'all might be wondering "Why non utilise a learning charge per unit scheduler?"

There are a number of learning charge per unit schedulers available to us, including:

Linear and polynomial decay
Cyclical Learning Rates (CLRs)
Keras' ReduceLROnPlateau class

If the goal is to improve model accurateness by dropping the learning rate, and so why non merely rely on those respective schedules and classes?

Smashing question.

The problem is that you may not accept a good idea on:

The approximate number of epochs to train for
What a proper initial learning charge per unit is
What learning charge per unit range to use for CLRs

Additionally, one of the benefits of using what I call ctrl + c training is that information technology gives you more fine-grained control over your model.

Existence able to manually stop your preparation at a specific epoch, accommodate your learning charge per unit, and then resume training from where you lot left off (and with the new learning rate) is something nearly learning rate schedulers will not permit you to exercise.

Once you lot've ran a few experiments with ctrl + c training you'll take a expert idea on what your hyperparameters should exist — when that happens, and then yous commencement incorporating hardcoded learning rate schedules to boost your accuracy even further.

Finally, keep in mind that most all seminal CNN papers that were trained on ImageNet used a method to commencement/stop/resume grooming.

Just considering other methods exist doesn't make them inherently better — as a deep learning practitioner, you need to learn how to use ctrl + c training along with learning rate scheduling (don't rely strictly on the latter).

If y'all're interested in learning more about ctrl + c training, along with my tips, suggestions, and best practices when grooming your ain models, be sure to refer to my volume, Deep Learning for Calculator Vision with Python.

Configuring your development environment

To configure your system for this tutorial, I first recommend post-obit either of these tutorials:

How to install TensorFlow ii.0 on Ubuntu
How to install TensorFlow ii.0 on macOS

Either tutorial will help you configure you lot arrangement with all the necessary software for this blog post in a convenient Python virtual environment.

Please note that PyImageSearch does non recommend or support Windows for CV/DL projects.

Project structure

Let's review our project construction:

$ tree --dirsfirst . ├── output │   ├── checkpoints │   └── resnet_fashion_mnist.png ├── pyimagesearch │   ├── callbacks │   │   ├── __init__.py │   │   ├── epochcheckpoint.py │   │   └── trainingmonitor.py │   ├── nn │   │   ├── __init__.py │   │   └── resnet.py │   └── __init__.py └── railroad train.py  5 directories, 8 files

Today nosotros volition review train.py , our training script. This script trains Fashion MNIST on ResNet.

The cardinal to this training script is that it uses two "callbacks", epochcheckpoint.py and trainingmonitor.py . I review these callbacks in detail inside Deep Learning for Computer Vision with Python — they aren't covered today, merely I encourage yous to review the code.

These two callbacks allow u.s.a. to (1) save our model at the end of every N-th epoch so we can resume preparation on demand, and (2) output our training plot at the decision of each epoch, ensuring we can easily monitor our model for signs of overfitting.

The models are checkpointed (i.east. saved) in the output/checkpoints/ directory.

2020-06-05 Update: At that place is no-longer an accompanying JSON file in the output/ folder for this tutorial. For TensorFlow 2+, it is not necessary and it introduces an mistake.

The grooming plot is overwritten upon each epoch end every bit resnet_fashion_mnist.png . We'll be paying shut attending to the preparation plot to determine when to stop training.

Implementing the preparation script

Let'due south get started implementing our Python script that volition be used for starting, stopping, and resuming training with Keras.

This guide is written for intermediate practitioners, even though information technology teaches an essential skill. If you are new to Keras or deep learning, or peradventure you just need to brush up on the basics, definitely cheque out my Keras Tutorial first.

Open up a new file, name it train.py, and insert the following code:

# set the matplotlib backend so figures can be saved in the background import matplotlib matplotlib.use("Agg")  # import the necessary packages from pyimagesearch.callbacks.epochcheckpoint import EpochCheckpoint from pyimagesearch.callbacks.trainingmonitor import TrainingMonitor from pyimagesearch.nn.resnet import ResNet from sklearn.preprocessing import LabelBinarizer from tensorflow.keras.preprocessing.paradigm import ImageDataGenerator from tensorflow.keras.optimizers import SGD from tensorflow.keras.datasets import fashion_mnist from tensorflow.keras.models import load_model import tensorflow.keras.backend as Thousand import numpy equally np import argparse import cv2 import sys import os

Lines ii-19 import our required packages, namely our EpochCheckpoint and TrainingMonitor callbacks. We also import our fashion_mnist dataset and ResNet CNN. The tensorflow.keras.backend as One thousand will allow usa to retrieve and set our learning charge per unit.

Now let'southward go alee and parse command line arguments:

# construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-c", "--checkpoints", required=True, 	help="path to output checkpoint directory") ap.add_argument("-m", "--model", type=str, 	help="path to *specific* model checkpoint to load") ap.add_argument("-southward", "--offset-epoch", type=int, default=0, 	help="epoch to restart preparation at") args = vars(ap.parse_args())

Our command line arguments include:

--checkpoints : The path to our output checkpoints directory.
--model : The optional path to a specific model checkpoint to load when resuming training.
--start-epoch : The optional kickoff epoch can exist provided if you are resuming grooming. By default, training starts at epoch 0 .

Permit's go ahead and load our dataset:

# grab the Fashion MNIST dataset (if this is your first time running # this the dataset will be automatically downloaded) print("[INFO] loading Mode MNIST...") ((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()  # Way MNIST images are 28x28 just the network we will exist grooming # is expecting 32x32 images trainX = np.assortment([cv2.resize(x, (32, 32)) for x in trainX]) testX = np.array([cv2.resize(10, (32, 32)) for 10 in testX])  # calibration data to the range of [0, 1] trainX = trainX.astype("float32") / 255.0 testX = testX.astype("float32") / 255.0  # reshape the information matrices to include a channel dimension (required # for training) trainX = trainX.reshape((trainX.shape[0], 32, 32, 1)) testX = testX.reshape((testX.shape[0], 32, 32, i))

Line 34 loads Fashion MNIST.

Lines 38-48 then preprocess the data including (1) resizing to 32×32, (2) scaling pixel intensities to the range [0, i], and (3) adding a channel dimension.

From here we'll (1) binarize our labels, and (2) initialize our information augmentation object:

# catechumen the labels from integers to vectors lb = LabelBinarizer() trainY = lb.fit_transform(trainY) testY = lb.transform(testY)  # construct the image generator for data augmentation aug = ImageDataGenerator(width_shift_range=0.1, 	height_shift_range=0.1, horizontal_flip=Truthful, 	fill_mode="nearest")

And now to the code for loading model checkpoints:

# if there is no specific model checkpoint supplied, and then initialize # the network (ResNet-56) and compile the model if args["model"] is None: 	print("[INFO] compiling model...") 	opt = SGD(lr=1e-1) 	model = ResNet.build(32, 32, 1, 10, (nine, 9, 9), 		(64, 64, 128, 256), reg=0.0001) 	model.compile(loss="categorical_crossentropy", optimizer=opt, 		metrics=["accurateness"])  # otherwise, we're using a checkpoint model else: 	# load the checkpoint from deejay 	print("[INFO] loading {}...".format(args["model"])) 	model = load_model(args["model"])  	# update the learning rate 	print("[INFO] old learning rate: {}".format( 		G.get_value(model.optimizer.lr))) 	K.set_value(model.optimizer.lr, 1e-ii) 	print("[INFO] new learning rate: {}".format( 		K.get_value(model.optimizer.lr)))

If no model checkpoint is supplied then we need to initialize the model (Lines 62-68). Notice that we specify our initial learning charge per unit every bit 1e-1 on Line 64.

Otherwise, Lines 71-81 load the model checkpoint (i.e. a model that was previously stopped via ctrl + c ) and update the learning rate. Line 79 will be the line yous edit whenever you lot desire to update the learning rate.

Next, we'll construct our callbacks:

# build the path to the preparation plot and training history plotPath = bone.path.sep.bring together(["output", "resnet_fashion_mnist.png"]) jsonPath = os.path.sep.join(["output", "resnet_fashion_mnist.json"])  # construct the set up of callbacks callbacks = [ 	EpochCheckpoint(args["checkpoints"], every=5, 		startAt=args["start_epoch"]), 	TrainingMonitor(plotPath, 		jsonPath=jsonPath, 		startAt=args["start_epoch"])]

Lines 84 and 85 specify our plot and JSON paths.

Lines 88-93 construct two callbacks , putting them straight into a list:

EpochCheckpoint : This callback is responsible for saving our model every bit it currently stands at the conclusion of every epoch. That way, if we stop training via ctrl + c (or an unforeseeable power failure), we don't lose our machine'southward piece of work — for training circuitous models on huge datasets, this could quite literally save you days of time.
TrainingMonitor : A callback that saves our training accuracy/loss information as a PNG paradigm plot and JSON lexicon. We'll be able to open up our training plot at any fourth dimension to see our training progress — valuable data to you lot as the practitioner, especially for multi-day preparation processes.

Again, please review epochcheckpoint.py and trainingmonitor.py on your own time for the details and/or if you demand to add together functionality. I encompass these callbacks in detail inside Deep Learning for Computer Vision with Python.

Finally, we have everything nosotros demand to start, stop, and resume training. This last block really starts or resumes training:

# train the network print("[INFO] training network...") model.fit( 	ten=aug.flow(trainX, trainY, batch_size=128), 	validation_data=(testX, testY), 	steps_per_epoch=len(trainX) // 128, 	epochs=80, 	callbacks=callbacks, 	verbose=1)

2020-06-05 Update: Formerly, TensorFlow/Keras required employ of a method called .fit_generator in order to accomplish data augmentation. Now, the .fit method can handle data augmentation too, making for more-consequent code. This also applies to the migration from .predict_generator to .predict (non used in this case). Be sure to bank check out my articles about fit and fit_generator as well as data augmentation.

Our phone call to .fit fits/trains our model using and our callbacks (Lines 97-103). Exist certain to review my tutorial on Keras' fit method for more details on how the .fit function is used to railroad train our model.

I'd like to telephone call your attention to the epochs parameter (Line 101) — when you adjust your learning rate yous'll typically want to update the epochs every bit well. Typically you should over-estimate the number of epochs as yous'll see in the next three sections.

For a more detailed caption of starting, stopping, and resuming training (along with the implementations of my EpochCheckpoint and TrainingMonitor classes), exist certain to refer to Deep Learning for Computer Vision with Python .

Phase #1: 40 epochs at 1e-1

Brand certain yous've used the "Downloads" section of this blog mail to download the source code to this tutorial.

From there, open upwardly a terminal and execute the following control:

$ python railroad train.py --checkpoints output/checkpoints [INFO] loading Fashion MNIST... [INFO] compiling model... [INFO] training network... Epoch ane/twoscore 468/468 [==============================] - 46s 99ms/step - loss: 1.2367 - accuracy: 0.7153 - val_loss: one.0503 - val_accuracy: 0.7712 Epoch 2/40 468/468 [==============================] - 46s 99ms/step - loss: 0.8753 - accuracy: 0.8427 - val_loss: 0.8914 - val_accuracy: 0.8356 Epoch 3/40 468/468 [==============================] - 45s 97ms/step - loss: 0.7974 - accuracy: 0.8683 - val_loss: 0.8175 - val_accuracy: 0.8636 Epoch 4/40 468/468 [==============================] - 46s 98ms/step - loss: 0.7490 - accurateness: 0.8850 - val_loss: 0.7533 - val_accuracy: 0.8855 Epoch 5/40 468/468 [==============================] - 46s 98ms/stride - loss: 0.7232 - accuracy: 0.8922 - val_loss: 0.8021 - val_accuracy: 0.8587 ... Epoch 36/forty 468/468 [==============================] - 44s 94ms/footstep - loss: 0.4111 - accuracy: 0.9466 - val_loss: 0.4719 - val_accuracy: 0.9265 Epoch 37/40 468/468 [==============================] - 44s 94ms/pace - loss: 0.4052 - accuracy: 0.9483 - val_loss: 0.4499 - val_accuracy: 0.9343 Epoch 38/40 468/468 [==============================] - 44s 94ms/step - loss: 0.4009 - accuracy: 0.9485 - val_loss: 0.4664 - val_accuracy: 0.9270 Epoch 39/twoscore 468/468 [==============================] - 44s 94ms/step - loss: 0.3951 - accuracy: 0.9495 - val_loss: 0.4685 - val_accuracy: 0.9277 Epoch 40/forty 468/468 [==============================] - 44s 95ms/stride - loss: 0.3895 - accuracy: 0.9497 - val_loss: 0.4672 - val_accuracy: 0.9254

**Figure iii:** Phase 1 of preparation ResNet on the Fashion MNIST dataset with a learning rate of 1e-1 for 40 epochs earlier nosotros stop via `ctrl + c`, suit the learning rate, and resume Keras preparation.

Here I've started training ResNet on the Fashion MNIST dataset using the SGD optimizer and an initial learning rate of 1e-1.

Afterwards every epoch my loss/accurateness plot in Figure 3 updates, enabling me to monitor training in existent-fourth dimension.

Past epoch xx nosotros can run across training and validation loss starting to diverge, and by epoch twoscore I decided to ctrl + c out of the train.py script.

Phase #two: 10 epochs at 1e-2

The next step is to update both:

My learning charge per unit
The number of epochs to train for

For the learning charge per unit, the standard practice is to lower information technology by an guild of magnitude.

Going dorsum to Line 64 of railroad train.py we tin can see that my initial learning rate is 1e-1 :

# if there is no specific model checkpoint supplied, so initialize # the network (ResNet-56) and compile the model if args["model"] is None: 	print("[INFO] compiling model...") 	opt = SGD(lr=1e-1) 	model = ResNet.build(32, 32, 1, ten, (9, 9, 9), 		(64, 64, 128, 256), reg=0.0001) 	model.compile(loss="categorical_crossentropy", optimizer=opt, 		metrics=["accuracy"])

I'm now going to update my learning charge per unit to be 1e-2 on Line 79:

# otherwise, we're using a checkpoint model else: 	# load the checkpoint from disk 	print("[INFO] loading {}...".format(args["model"])) 	model = load_model(args["model"])  	# update the learning rate 	print("[INFO] erstwhile learning rate: {}".format( 		Yard.get_value(model.optimizer.lr))) 	G.set_value(model.optimizer.lr, 1e-2) 	impress("[INFO] new learning rate: {}".format( 		K.get_value(model.optimizer.lr)))

And then, why am I updating Line 79 and not Line 64?

The reason is due to the if/else statement.

The else argument handles when we need to load a specific checkpoint from disk — one time we have the checkpoint we'll resume training, thus the learning charge per unit needs to be updated in the else block.

Secondly, I also update my epochs on Line 101. Initially, the epochs value was 80 :

# train the network impress("[INFO] grooming network...") model.fit( 	x=aug.flow(trainX, trainY, batch_size=128), 	validation_data=(testX, testY), 	steps_per_epoch=len(trainX) // 128, 	epochs=80, 	callbacks=callbacks, 	verbose=1)

I take decided to lower the number of epochs to train for to forty epochs:

# train the network print("[INFO] training network...") model.fit( 	x=aug.flow(trainX, trainY, batch_size=128), 	validation_data=(testX, testY), 	steps_per_epoch=len(trainX) // 128, 	epochs=40, 	callbacks=callbacks, 	verbose=1)

Typically yous'll set the epochs value to be much larger than what you think it should actually be.

The reason for this is due to the fact that nosotros're using the EpochCheckpoint class to save model snapshots every 5 epochs — if at any point nosotros decide we're unhappy with the training progress we tin can just ctrl + c out of the script and become back to a previous snapshot.

Thus, there is no harm in training for longer since we can always resume training from a previous model weight file.

After both my learning rate and the number of epochs to railroad train for were updated, I then executed the following control:

$ python railroad train.py --checkpoints output/checkpoints \ 	--model output/checkpoints/epoch_40.hdf5 --start-epoch twoscore [INFO] loading Way MNIST... [INFO] loading output/checkpoints/epoch_40.hdf5... [INFO] old learning rate: 0.10000000149011612 [INFO] new learning charge per unit: 0.009999999776482582 [INFO] training network... Epoch 1/10 468/468 [==============================] - 45s 97ms/stride - loss: 0.3606 - accuracy: 0.9599 - val_loss: 0.4173 - val_accuracy: 0.9412 Epoch 2/10 468/468 [==============================] - 44s 94ms/step - loss: 0.3509 - accuracy: 0.9637 - val_loss: 0.4171 - val_accuracy: 0.9416 Epoch 3/10 468/468 [==============================] - 44s 94ms/pace - loss: 0.3484 - accuracy: 0.9647 - val_loss: 0.4144 - val_accuracy: 0.9424 Epoch iv/ten 468/468 [==============================] - 44s 94ms/step - loss: 0.3454 - accuracy: 0.9657 - val_loss: 0.4151 - val_accuracy: 0.9412 Epoch 5/10 468/468 [==============================] - 46s 98ms/stride - loss: 0.3426 - accuracy: 0.9667 - val_loss: 0.4159 - val_accuracy: 0.9416 Epoch 6/ten 468/468 [==============================] - 45s 96ms/step - loss: 0.3406 - accuracy: 0.9663 - val_loss: 0.4160 - val_accuracy: 0.9417 Epoch 7/10 468/468 [==============================] - 45s 96ms/step - loss: 0.3409 - accurateness: 0.9663 - val_loss: 0.4150 - val_accuracy: 0.9418 Epoch 8/ten 468/468 [==============================] - 44s 94ms/stride - loss: 0.3362 - accuracy: 0.9687 - val_loss: 0.4159 - val_accuracy: 0.9428 Epoch 9/10 468/468 [==============================] - 44s 95ms/pace - loss: 0.3341 - accuracy: 0.9686 - val_loss: 0.4175 - val_accuracy: 0.9406 Epoch x/10 468/468 [==============================] - 44s 95ms/footstep - loss: 0.3336 - accuracy: 0.9687 - val_loss: 0.4164 - val_accuracy: 0.9420

**Effigy 4:** Phase 2 of Keras commencement/stop/resume training. The learning rate is dropped from 1e-i to 1e-2 equally is evident in the plot at epoch 40. I continued training for 10 more epochs until I noticed validation metrics plateauing at which signal I stopped preparation via `ctrl + c` again.

Notice how we've updated our learning rate from 1e-i to 1e-two and then resumed preparation.

We immediately encounter a drop in both training/validation loss as well as an increase in preparation/validation accuracy.

The problem here is that our validation metrics have plateaued out — in that location may non exist much more than gains left without risking overfitting. Because of this, I just allowed training to continue for some other 10 epochs before once once more ctrl + c ing out of the script.

Phase #3: v epochs at 1e-3

For the final stage of training I decided to:

Lower my learning charge per unit from 1e-2 to 1e-3 .
Allow preparation to go on (just knowing I would probable only be training for a few epochs given the risk of overfitting).

After updating my learning rate, I executed the following command:

$ python train.py --checkpoints output/checkpoints \ 	--model output/checkpoints/epoch_50.hdf5 --start-epoch 50 [INFO] loading Fashion MNIST... [INFO] loading output/checkpoints/epoch_50.hdf5... [INFO] erstwhile learning rate: 0.009999999776482582 [INFO] new learning rate: 0.0010000000474974513 [INFO] training network... Epoch 1/five 468/468 [==============================] - 45s 97ms/step - loss: 0.3302 - accuracy: 0.9696 - val_loss: 0.4155 - val_accuracy: 0.9414 Epoch 2/5 468/468 [==============================] - 44s 94ms/footstep - loss: 0.3297 - accuracy: 0.9703 - val_loss: 0.4160 - val_accuracy: 0.9411 Epoch 3/5 468/468 [==============================] - 44s 94ms/step - loss: 0.3302 - accurateness: 0.9694 - val_loss: 0.4157 - val_accuracy: 0.9415 Epoch 4/5 468/468 [==============================] - 44s 94ms/stride - loss: 0.3282 - accuracy: 0.9708 - val_loss: 0.4143 - val_accuracy: 0.9421 Epoch five/five 468/468 [==============================] - 44s 95ms/step - loss: 0.3305 - accuracy: 0.9694 - val_loss: 0.4152 - val_accuracy: 0.9414

**Figure five:** Upon resuming Keras training for phase 3, I only let the network train for 5 epochs because there is not significant learning progress beingness made. Using a start/stop/resume training approach with Keras, we take accomplished 94.14% validation accurateness.

At this point the learning rate has become and then small-scale that the respective weight updates are also very small, implying that the model cannot learn much more.

I simply allowed training to continue for 5 epochs before killing the script. Even so, looking at my final metrics you tin run into what we are obtaining 96.94% preparation accuracy along with 94.14% validation accuracy.

We were able to achieve this result by using our start, stop, and resuming grooming method.

At this point, we could either continue to tune our learning charge per unit, utilize a learning rate scheduler, use Cyclical Learning Rates, or effort a new model architecture altogether.

What's next? I recommend PyImageSearch Academy.

Form information:
35+ total classes • 39h 44m video • Last updated: April 2022
★★★★★ 4.84 (128 Ratings) • 13,800+ Students Enrolled

I strongly believe that if you had the right teacher you could principal computer vision and deep learning.

Do you recall learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in figurer science?

That's non the case.

All yous need to primary estimator vision and deep learning is for someone to explain things to you in elementary, intuitive terms. And that'south exactly what I do. My mission is to change pedagogy and how circuitous Artificial Intelligence topics are taught.

If yous're serious nigh learning figurer vision, your adjacent cease should be PyImageSearch University, the virtually comprehensive reckoner vision, deep learning, and OpenCV course online today. Here you'll learn how to successfully and confidently utilize calculator vision to your work, research, and projects. Join me in figurer vision mastery.

Inside PyImageSearch Academy you'll notice:

✓ 35+ courses on essential reckoner vision, deep learning, and OpenCV topics
✓ 35+ Certificates of Completion
✓ 39+ hours of on-demand video
✓ Brand new courses released regularly , ensuring you lot can keep up with land-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your spider web browser — works on Windows, macOS, and Linux (no dev surround configuration required!)
✓ Access to centralized code repos for all 450+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial yous learned how to start, cease, and resume preparation using Keras and Deep Learning.

Learning how to resume from where your preparation left off is a super valuable skill for ii reasons:

It ensures that if your training script crashes, you can pick upwardly again from the nigh recent model checkpoint.
It enables you to accommodate your learning rate and improve your model accuracy.

When grooming your own custom neural networks you'll desire to monitor your loss and accuracy — once you lot start to see validation loss/accuracy plateau, try killing the training script, lowering your learning charge per unit by an order of magnitude, and so resume preparation.

Y'all'll often find that this method of training tin can lead to higher accurateness models.

All the same, you lot should be wary of overfitting!

Lowering your learning rate enables your model to descend into lower areas of the loss landscape; all the same, in that location is no guarantee that these lower loss areas volition withal generalize!

You lot likely will only be able to drop the learning rate 1-three times earlier either:

The learning rate becomes likewise minor, making the respective weight updates too small, and preventing the model from learning further.
Validation loss stagnates or explodes while training loss continues to drib (implying that the model is overfitting).

If those cases occur and your model is nonetheless non satisfactory you should consider adjusting other hyperparameters to your model, including regularization strength, dropout, etc. Yous may want to explore other model architectures every bit well.

For more of my tips, suggestions, and best practices when preparation your own neural networks on your custom datasets, be sure to refer toDeep Learning for Computer Vision with Python, where I comprehend my all-time practices in-depth.

To download the source code to this tutorial (and be notified when futurity tutorials are published on the PyImageSearch web log), just enter your e-mail accost in the form below!

Download the Source Lawmaking and Complimentary 17-page Resource Guide

Enter your email address below to get a .cypher of the code and a Gratuitous 17-page Resources Guide on Calculator Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to assistance yous master CV and DL!

johnstonsheas1990.blogspot.com

Source: https://pyimagesearch.com/2019/09/23/keras-starting-stopping-and-resuming-training/