THE IMPACT OF EACH DEEP NEURAL NETWORK LAYER ON THE PERFORMANCE OF END-TO-END VIETNAMESE SPEECH RECOGNITION
In this paper, we analyze the impact of each deep neural network (DNN) layer on the performance of end-to-end Vietnamese speech recognition using 1D convolution layers and bi-directional gated recurrent unit (GRU) layers. In the first experiment, we use spectrogram, fully connected (FC) and connectionist temporal classification (CTC) layer to test the Vietnamese digit speech. In the next two experiments, we use the three above layers added with 1D convolution layers and GRU layers. The results of the three experiments show that for Vietnamese speech recognition, 1D convolution and bi-directional GRU layers are the most effective choice for DNN.
end-to-end speech recognition, deep neural network.