Automating Neural Network Design
Authors | |
Published Feb 1, 2023 |

Outline:
Definition of Reinforcement learning
Categories in Reinforcement learning
Value-based methods
Policy-based methods
Model-based methods
Neural architecture search
Searching the architecture
Searching the weights
Reinforcement learning for neural architecture search
One-Short NAS
Progressive NAS
Summary
Introduction
The recent development in the field of deep learning has led to staggering new innovations and techniques in the overall architecture of the deep neural network. And with the industries growing and scaling rapidly to keep up with the demands and requirements designing a deep neural network in a smaller timeframe is a daunting and time-consuming task. To address this challenge we gaze our attention to the Neural Architecture Search (NAS), an automated way of designing a deep neural network.
This review article explores NAS in a fundamental manner and how we can leverage reinforcement learning to build an effective deep neural network.
Neural architecture search (NAS)
Neural architecture search (NAS) is a method for automatically discovering the best neural network architecture for a given task. This is done by training a model, called the controller, to generate new neural network architectures. Once the controller is established it can then evaluate its performance on the given task. The controller can be trained using various methods, such as reinforcement learning, evolutionary algorithms, or gradient-based optimization. This article specifically focuses on reinforcement learning. Elsken, et al 2019 in their paper describes that NAS can have three main approaches, namely:
Search space and search the architecture: This approach involves training the controller to generate new architectures. These architectures can be defined by the set of building blocks that the algorithm can use to construct the architectures, as well as the constraints and rules that govern the construction process. Different architectures can be generated and sampled from a large search space and then it can be trained to find the optimum performing architecture (Xie et. al. 2020).
Searching the weights: This approach builds on top of the previous approach. Here once the model has been selected the weights can be learned by training the network on a large dataset using gradient-based optimization algorithms. It is important to note that the process of searching the weights can be computationally expensive, especially for large and complex architectures. This is why some NAS algorithms use weight sharing, where the weights of different architectures are shared to reduce the number of parameters to be trained (Xie et. al. 2020).
Evaluation strategy: The goal of the performance estimation strategy is to quickly and accurately estimate the quality of a given architecture so that the NAS algorithm can make informed decisions about which architecture to select.
There are several strategies for performance estimation in NAS, including:
Full training: This strategy trains the network with the selected architecture on the full dataset, which provides the most accurate performance estimation but is also the most computationally expensive.
Weight sharing: This strategy uses weight sharing, where the weights of different architectures are shared, to reduce the number of parameters to be trained. This can greatly speed up the performance estimation process, but the accuracy of the performance estimation may be impacted.
One-shot methods: This strategy trains a super-network that contains all possible architectures, and then uses a smaller sub-network for performance estimation. The sub-network is selected based on performance or computational efficiency.
Performance prediction: This strategy uses a pre-trained model or performance prediction algorithm to estimate the performance of a given architecture, which can greatly speed up the performance estimation process, but the accuracy of the performance estimation may be impacted.
The choice of performance estimation strategy will depend on the specific task, the computational resources available, and the trade-off between accuracy and speed. A good performance estimation strategy is essential for the success of the NAS process, as it affects the quality of the architectures that the algorithm will select.

Source: https://lilianweng.github.io/posts/2020-08-06-nas/
NAS can be used to discover neural network architectures that are more efficient and perform better than those found through traditional search methods. However, NAS can be computationally expensive because it requires training multiple neural networks, and it can be challenging to design a controller that can effectively search the space of possible architectures.
What is Reinforcement learning?
Reinforcement learning (RL) is a type of machine learning in which an agent learns to make decisions by interacting with its environment and receiving feedback in the form of rewards or penalties. The agent's goal is to learn a policy, which is a mapping from states to actions, that maximizes the expected cumulative reward over time. Essentially RL algorithms use the explore-exploit trade-off technique to find the best policy. These algorithms are typically used to train agents to perform tasks such as playing games, controlling robots, or making decisions in uncertain environments. These algorithms differ from supervised learning algorithms, which learn from labeled examples, and unsupervised learning algorithms, which learn from unlabeled data.
Categories in Reinforcement learning
There are several different types of reinforcement learning (RL), but three main categories are:
Value-based methods: These methods aim to learn a function that estimates the long-term value of being in a particular state or taking a particular action. The agent's policy is then derived from this value function. Q-learning and SARSA are examples of value-based methods.
Policy-based methods: These methods directly learn the policy that maps states to actions, without estimating a value function. REINFORCE and actor-critic methods are examples of policy-based methods.
Model-based methods: These methods learn a model of the environment's dynamics, which can be used to plan future actions. The agent uses this model to simulate the consequences of different actions and to optimize its policy. Dyna-Q and various forms of POMDPs are examples of model-based methods.
It is worth noting that these categories are not mutually exclusive, and many RL algorithms combine elements of multiple categories. Also, there are other ways to classify RL, such as on-policy and off-policy methods, or based on the scale of the problem (e.g. single-agent or multi-agent)
Value-based methods in Reinforcement learning
Value-based methods in reinforcement learning (RL) aim to learn a function that estimates the long-term value of being in a particular state or taking a particular action. The agent's policy is then derived from this value function. Two main examples of value-based methods are:
Q-learning: Q-learning is a model-free, off-policy RL algorithm. It learns an estimate of the optimal action-value function, Q*, which gives the maximum expected cumulative reward for taking a particular action in a particular state and then following the optimal policy thereafter. The Q-function is updated using the Bellman equation, which expresses the Q-value of a state-action pair in terms of the Q-values of the next state and the immediate reward.
SARSA (State-Action-Reward-State-Action): SARSA is also a model-free, off-policy RL algorithm that learn an estimate of the action-value function of a particular policy, Qπ, rather than the optimal action-value function. It is similar to Q-learning, but it uses the action selected by the current policy instead of the action with the highest Q-value.
Value-based methods have a drawback, when the state space is very large, the value function can become impractical to represent and store, this is known as the curse of dimensionality. To overcome this problem, function approximation methods can be used, such as neural networks to approximate the value function.
Policy-based methods in Reinforcement learning
Policy-based methods in reinforcement learning (RL) directly learn the policy that maps states to actions, without estimating a value function. Two main examples of policy-based methods are:
REINFORCE: REINFORCE is a model-free, on-policy RL algorithm that learns the policy by directly optimizing the expected cumulative reward. It estimates the gradient of the expected cumulative reward with respect to the policy parameters, and then updates the policy to increase the expected cumulative reward;
Actor-Critic methods: Actor-Critic methods are a combination of both policy-based and value-based methods. They use two separate networks: an actor network, which learns the policy, and a critic network, which learns the value function. The critic network provides feedback to the actor network, which adjusts the policy to improve the expected cumulative reward;
Policy-based methods have an advantage over value-based methods when the policy has a simple functional form, such as a neural network, and it is easy to optimize the policy parameters directly. However, when the policy has a complex functional form, it can be difficult to optimize it directly, and value-based methods may be more appropriate.
Also, policy-based methods can be more sample-efficient than value-based methods because they do not require the agent to explore the entire state space, but they can be less stable during the learning process because the policy changes during the learning process
Model-based methods in Reinforcement learning
Model-based methods in reinforcement learning (RL) learn a model of the environment's dynamics, which can be used to plan future actions. The agent uses this model to simulate the consequences of different actions and to optimize its policy. Some examples of model-based methods include:
Dynamic Programming (DP): Dynamic Programming (DP) is a general method for solving RL problems. It involves breaking down the problem into smaller sub problems and solving them in a specific order. DP algorithms such as value iteration and policy iteration are used to find the optimal policy and value function for a given problem.
Planning: Planning algorithms such as A* search, tree search, and Monte Carlo Tree Search (MCTS) use the learned model of the environment to plan a sequence of actions that will lead to the highest expected cumulative reward. These methods are used for problems where the model of the environment is known and the state space is relatively small.
Model Predictive Control (MPC): Model Predictive Control (MPC) is a control method that uses a model of the environment to predict the future state of the system and to optimize a control policy. It is commonly used in control systems, robotics, and trajectory optimization.
Model-based methods have an advantage over model-free methods in that they can be more sample-efficient because they use a model of the environment to simulate different actions, rather than having to experience them in the real environment. However, model-based methods can be less robust than model-free methods because they are sensitive to errors in the model of the environment. Additionally, when the model of the environment is not known, it can be difficult or impossible to learn an accurate model, and in these cases, model-free methods may be more appropriate.
With primary knowledge of both defined we can now discuss how reinforcement learning can be used for NAS.
Reinforcement learning for neural architecture search
Reinforcement learning (RL) can be used to perform neural architecture search (NAS) by training a controller, typically a neural network, to generate new neural network architectures and then evaluate their performance on a given task. The controller learns to generate architectures that perform well on the task by maximizing a reward signal.
In this approach, the controller generates a new architecture, which is then trained and evaluated on the task. The controller receives a reward based on the performance of the architecture and uses this reward to update its policy and generate new architectures. The process is repeated until the controller generates an architecture that performs well on the task.

Source: https://arxiv.org/pdf/1611.01578.pdf
The image above represents an overview of Neural Architecture Search.
There are several different ways to design the reward function and the controller for RL-based NAS. Some methods use the accuracy of a validation set as the reward, while others use the number of parameters or the computational cost of the architecture. The controller can also be trained using various RL algorithms, such as Q-learning, actor-critic, or policy gradients. In other words, the controller can be optimized using Deep RL (Mazyavkina et. al. 2021).
RL-based NAS has the advantage of being able to explore the space of possible architectures in a more efficient way than other NAS methods. This is because the controller can learn from the rewards of previously generated architectures and focus on the promising regions of the search space. However, it can be computationally expensive and difficult to design the reward function and the controller.
Generally, in this approach, a controller neural network generates a sequence of actions, which correspond to the operations and connections in a neural network. The controller is trained using RL to generate architectures that perform well on the task.
The training process can be viewed as a sequence of decisions made by the controller, where each decision corresponds to an operation or connection in the architecture.
There are two main ways to use RL for NAS:
One-Shot NAS: In this approach, the controller generates a complete architecture in one go, and the entire architecture is trained and evaluated on the task. The final architecture is chosen based on the performance of the generated architecture.
Progressive NAS: In this approach, the controller generates the architecture incrementally, adding one operation or connection at a time. At each step, the current architecture is trained and evaluated on the task, and the controller receives a reward based on the performance. The final architecture is chosen based on the accumulated reward over all the steps.
Using RL for NAS can be more sample efficient than traditional search methods because the controller can explore the space of possible architectures more effectively. However, it can be computationally expensive because it requires training multiple neural networks and the controller itself. Additionally, it can be difficult to design a reward function that effectively guides the search towards good architectures.
One-Shot NAS
One-Shot NAS is a method of neural architecture search (NAS) that uses reinforcement learning (RL) to generate a complete neural network architecture in one go. The process can be broken down into the following steps:
The controller is trained to generate a sequence of actions that correspond to the operations and connections in a neural network. The controller is trained using RL to generate architectures that perform well on the task.
Once the controller is trained, it generates a complete neural network architecture in one go. This architecture is then trained and evaluated on the task.
The performance of the generated architecture is used as a reward signal for the controller, which is then updated to generate better architectures in the future.
The process is repeated until the generated architecture reaches a satisfactory level of performance on the task.
One-Shot NAS has the advantage of being computationally more efficient than progressive NAS, since it doesn't require training a large number of sub-architectures, but the drawback is that it is harder to recover from a poor decision made early on in the architecture generation process. Additionally, One-Shot NAS is sensitive to the initialization of the controller, so it may require a good initialization to work well. One-Shot NAS is also sensitive to the number of operations to be chosen, because if the number is too large, the controller may find it hard to generate a good architecture, and if it is too small the controller may generate too simple architectures, which may not perform well on the task.
The process does not have a specific formula to calculate the final architecture, but the final architecture is chosen based on the performance on a task.
One-Shot NAS is typically formulated as a sequence generation problem, where the goal is to generate a sequence of actions that correspond to the operations and connections in a neural network. The controller, which is trained using RL, generates a complete architecture in one go. The architecture is then trained and evaluated on the task, and the controller receives a reward based on the performance.
The reward function is a crucial component of One-Shot NAS, as it guides the controller towards generating architectures that perform well on the task. The reward function can be defined in various ways, depending on the task and the performance metric. For example, it could be the accuracy on a validation set, the mean squared error on a regression task, or the area under the receiver operating characteristic curve for a binary classification task.
The controller, in One-Shot NAS, is trained to generate a sequence of actions that will lead to the highest reward on the task. The controller can be trained using various RL algorithms such as Q-learning, SARSA, or Proximal Policy Optimization (PPO).
The final architecture is the one that has the highest reward. However, unlike Progressive NAS, there is no incremental process, so it is harder to recover from poor decisions made early on in the architecture generation process. One-Shot NAS is also sensitive to the initialization of the controller, so it may require a good initialization to work well.
Progressive NAS
Progressive NAS is a method of neural architecture search (NAS) that uses reinforcement learning (RL) to generate a neural network architecture incrementally, one operation or connection at a time. The process can be broken down into the following steps:
Initialize a small and simple architecture, such as a single convolutional layer.
Train the controller using RL to generate a sequence of actions that correspond to the operations and connections to be added to the architecture.
At each step, the controller generates an action, which corresponds to an operation or connection to be added to the current architecture. The current architecture is then trained and evaluated on the task, and the controller receives a reward based on the performance.
The process is repeated until a satisfactory level of performance on the task is reached or a pre-defined number of steps is reached.
The final architecture is the one that accumulated the highest reward over all the steps.
Progressive NAS has the advantage of being able to recover from poor decisions made early on in the architecture generation process, by incorporating more successful decisions later on. However, it is computationally more expensive than One-Shot NAS, as it requires training multiple sub-architectures. Progressive NAS also require a good initialization of the controller and a good definition of the reward function. Additionally, the number of steps in the progressive NAS process must be chosen carefully, as too many steps can lead to overfitting, and too few steps may not lead to finding a good enough architecture.
Like One-Shot NAS, Progressive NAS does not have a specific formula to calculate the final architecture, but the final architecture is chosen based on the accumulated reward over all the steps.
The reward function is a crucial component of Progressive NAS, as it guides the controller towards generating architectures that perform well on the task. The reward function can be defined in various ways, depending on the task and the performance metric. For example, it could be the accuracy on a validation set, the mean squared error on a regression task, or the area under the receiver operating characteristic curve for a binary classification task.
The controller, in Progressive NAS, is trained to generate actions that will lead to the highest accumulated reward over all the steps. The controller can be trained using various RL algorithms such as Q-learning, SARSA, or Proximal Policy Optimization (PPO).
Progressive NAS can be formalized as an optimization problem, where the goal is to find the architecture that maximizes the expected reward over all the steps. The final architecture is the one that has the highest expected reward.
Summary
Neural architecture search (NAS) is a method for automatically discovering the best neural network architecture for a given task. RL can be used to guide NAS by training an agent to select and modify the architecture of a neural network based on the rewards it receives for the performance of that architecture on a given task. This can lead to the discovery of neural network architectures that are more efficient and perform better than those found through traditional search methods.
Reinforcement learning (RL) is a type of machine learning that is used for neural architecture search (NAS). In RL for NAS, a controller is trained to generate neural network architectures that perform well on a task, by receiving rewards or penalties based on the performance of the generated architectures.
There are two main methods of RL for NAS: One-Shot NAS and Progressive NAS.
One-Shot NAS is a method where the controller generates a complete neural network architecture in one go. The architecture is then trained and evaluated on the task, and the controller receives a reward based on the performance. The final architecture is the one that has the highest reward. However, unlike Progressive NAS, there is no incremental process, so it is harder to recover from poor decisions made early on in the architecture generation process.
Progressive NAS is a method where the controller generates a neural network architecture incrementally, one operation or connection at a time. The process is repeated until a satisfactory level of performance on the task is reached or a pre-defined number of steps is reached. The final architecture is the one that accumulated the highest reward over all the steps. Progressive NAS has the advantage of being able to recover from poor decisions made early on in the architecture generation process, by incorporating more successful decisions later on.
Both methodologies require a good definition of the reward function, a good initialization of the controller and a good number of steps. Additionally, both methods are computationally expensive, and require a lot of computational resources.
References
Reinforcement Learning by Richard S. Sutton
Optimizing the Neural Architecture of Reinforcement Learning Agents
Weight-Sharing Neural Architecture Search: A Battle to Shrink the Optimization Gap