Nvprof cudalaunch

Therefore, data samples are usually batched together before feeding into the neural networks. The performance of training is characterized by throughput (number of training samples processed per second) instead of latency that is usually used in inference. In this work, we focus on LSTM RNN training, which is different from inference in the following major ways: (1) Compute: In training, there is an extra backward pass that propagates gradients back from the loss layer. These fundamental differences limit the applicability of prior work in reducing memory footprint of CNN training (e.g., vDNN, Gist ) in the context of LSTM RNNs. While CNNs extensively use convolutions, relu activations and poolings, LSTM RNNs mostly use fully-connected layers, tanh/sigmoid activations, and element-wise operations. From a low-level perspective, LSTM RNNs and CNNs use different sets of layers (operators). From a high-level perspective, the computation graph of LSTM RNN exhibits a recurrent structure that processes one input at a time, limiting the amount of model parallelism. The reason for this low utilization of LSTM RNNs compared with CNNs lies in the difference between their high-level structure and the types of dominant layers (operators) used in their computation. also suggests that LSTM RNN has low compute utilization and is limited by GPU memory capacity. LSTM RNN is one of the most important machine learning models for analyzing sequential data today, having applications in language modeling, machine translation, and speech recognition ĭespite its importance, LSTM RNN training has been shown to have much lower throughput on GPUs compared to other types of networks such as Convolutional Neural Networks (CNNs). We integrateĮcoRNN into MXNet Python library and open-source it to benefit machine learning Our optimizations also apply to other RNNĬell types such as LSTM variants and Gated Recurrent Units (GRUs). Optimization can give us a maximum performance boost of 3x over MXNet defaultĪnd 1.5x over cuDNN implementations. We show that (1) fusing tiny GPU kernels and (2) applying data layout Open-source implementation in MXNet and is competitive with the closed-sourceĬuDNN. Implementation called EcoRNN that is significantly faster than the SOTA

PyTorch, that use cuDNN as their backend. Hampering further research and performance improvements in frameworks, such as Although cuDNN, NVIDIA's deep learning library, canĪccelerate performance by around 2x, it is closed-source and inflexible, For example, default implementations in TensorflowĪnd MXNet invoke many tiny GPU kernels, leading to excessive overhead in Implementations of LSTM RNN in machine learning frameworks usually either lack State-of-the-art (SOTA) model for analyzing sequential data. Long-Short-Term-Memory Recurrent Neural Network (LSTM RNN) is a