Haeun Yoo, a member of Prof. Jay H. Lee’s group, has proposed a reinforcement learning (RL) based batch process control strategy. Batch process control represents a challenge given its dynamic operation over a large operating envelope. Nonlinear model predictive control (NMPC), current standard for optimal control of batch processes, is a model-based algorithm, thus its performance can be unsatisfactory in the presence of uncertainties. Reinforcement Learning (RL) is a viable alternative for such problems. To effectively apply RL to batch process control, Haeun Yoo has proposed a phase segmentation approach for the reward function design and value/policy function representation. In addition, the deep deterministic policy gradient algorithm (DDPG) is modified with Monte-Carlo learning to ensure more stable and efficient learning behavior. A case study of a batch polymerization process producing polyols is used to demonstrate the improvement and the results were meaningful. The results have shown RL can be successfully applied to batch process control under uncertainty to ensure stable and efficient learning behavior. The study was published on January 4 in Computers and Chemical Engineering (Reinforcement learning based optimal control of batch processes using Monte-Carlo deep deterministic policy gradient with phase segmentation, Computers and Chemical Engineering 144 (2021) 107133).
The researcher has considered three types of rewards for path constraints, end-point constraints, and process performance over time. In addition, the overall process was divided into two phases, which were feeding focus phase and reaction focus phase. This phase segmentation approach can deal with different process operation characteristics in each phase and discontinuous action profiles.
The main idea of the study was Monte-Carlo Deep Deterministic Policy Gradient (MC-DDPG) algorithm, the modified version of DDPG algorithm.
“Traditional DDPG algorithm was based on actor/critic neural network to handle high dimensional, continuous state and action spaces. This method used bootstrapping, which meant target value was an estimated value using action-value function. However, bootstrapping can make actor and critic get stuck in local minima or even diverge. Moreover, inaccurately estimated critic values can lead to bad samples such that those values can worsen processes. These problems were serious since most chemical processes were reversible,” Yoo said.
The researcher used calculated return values (discounted accumulative reward) as a target value instead of using expected target values. Moreover, a target network that was required to update estimated target value in traditional DDPG algorithm was removed in MC-DDPG algorithm. The actor neural network was initially trained using data from NMPC by imitation learning to initialize the parameters in a reasonable way.
The performance of the developed reinforcement learning based controller with the MC-DDPG algorithm was compared with NMPC, and an agent only trained by imitation learning. As a result, the RL-based controller showed enhanced ability to satisfy the path and end-point constraints in the presence of model uncertainties. The impact to the choice of phase segmentation point on the optimality was also discussed by training the agent with different phase durations. The importance of determining proper hyperparameters was discussed with sensitivity analysis of the reward value.
“The proposed approach and algorithm can be applied to other batch or semi-batch process (e.g. bio-reactor) problems, especially those that have significant uncertainties and irreversible and nonlinear dynamics,” Yoo said.