**Axe & tâche scientifique DigiCosme : ** ComEx –

**Coordinateurs :** Giuseppe Valenzise – Marco Cagnazzo

**Nom & Prénom du Candidat :** Li WANG

**Laboratoires : ** L2S, LTCI

**Laboratoire gestionnaire : ** L2S

**Adossé à l’action DigiCosme :** GT Théorie de l’information

**Durée & Dates de la mission :** 1 an – 1er septembre 2018/ 31 août 2019

**Contexte : **

Prediction and entropy coding are two fundamental tools of modern image and video compression, which have been continuously improved and optimized during the last 20 years. The effectiveness of these tools depends on their ability to model the (generally unknown) distribution or process from which the data has been generated. Data modeling in video coding has traditionally leveraged ensemble approaches, where multiple simple models are

separately computed and put in competition to obtain the most likely representation of the observed data. This is the case, for example, of spatial prediction, which employs a number of different bilinear filters to predict the next most likely pixels in a block 2. Increasing the size of the ensemble model can guarantee a better approximation of the data-generating distribution, but entails a higher computational cost and is ineffective if the complexity of the signal to predict exceeds the representational capacity of each model (i.e., the bilinearity assumption). In the case of entropy coding, an accurate estimation of the probability distribution of a symbol is key to reduce the bitrate: the extra bit cost due to the approximation increases with the Kullback-Leibler (KL) divergence between the real and the estimated distribution 11. In order to exploit the high-order dependencies among symbols, modern video codecs employ context-adaptive binary arithmetic coding (CABAC), which infers the probability distributions according to an ensemble of contexts of previously decoded symbols. A higher number of contexts enables to take more advantage

of symbol dependencies; however, when too many contexts are used to capture long-term dependencies, probability estimation becomes unreliable, due to the limited amount of data available during the encoding – a problem known as “context dilution” 10. In the last few years, it has been shown that deep generative models – from restricted Boltzman machines to

variational auto-encoders and, more recently, generative adversarial networks 6 – can learn effective representations for very complex signals such as natural images. There is a tight relationship between learning deep generative models and information theory: when deep models are trained to maximize the likelihood of the data, they indeed learn a posterior distribution of the data which minimizes the KL divergence with the observed samples, i.e., they are able to find a code for the data with the minimum description length. Interestingly, the advantage using deep neural network (DNN) architectures, compared to the conventional ensemble approaches used in video coding, is that DNN models have a much larger representational capacity and can approximate any function or distribution. Very recently, deep generative models have been applied to image compression 3, 4, 8, 9. The basic architecture of these methods relies on the use of auto-encoders, which are DNNs trained to reproduce their input: in doing so, they learn a latent representation of the input data distribution. Compression-oriented auto-encoders are generally trained end-to-end, i.e., all the coding operations (transform, quantization and entropy coding) are learned concurrently. While this is feasible for image coding, in the case of video the chain of operations is much more complex, and existing codec architectures are highly integrated and optimized. This suggests that, before inventing a completely new video codec, it would be worth to first explore how to optimize individual components of existing architectures, such as spatial and temporal prediction, or probability estimation for arithmetic coding. Recent work on spatial prediction for image compression 5 and inpainting 7, as well as on estimating conditional probabilities for arithmetic coding 8, suggest high potential gains compared to traditional approaches. To date, the application of these techniques to video coding has not yet been explored.

**Objectif :**

The goal of the V-CODE project is to study how deep generative models can be employed to optimize the performance of existing video codecs. In particular, we will focus on two typical coding tools, which account for a large part of the efficiency of current video coding techniques: spatial/temporal prediction; and entropy coding. In both case, we will leverage the possibility, offered by deep generative models, of handling efficiently large contexts of samples, compared to existing ensemble techniques. The project is thus articulated along these two objectives:

**Objective 1**: optimizing spatial and temporal prediction with a deep generative model. Current models for spatial and temporal prediction are based on simple signal representations, which however work remarkably well due to the large number of available models. These include prediction directions and block partitioning 2. A drawback of this approach is that signaling the prediction mode and the associated parameters becomes very expensive, as well as the complexity involved to optimize the mode decision. Would it be possible instead to replace the numerous prediction modes with a single, “deep” predictor? Recent work on image inpainting 7 and temporal frame

interpolation 12 show that deep generative models have the potential to provide better predictions than state-of the-art approaches. Nevertheless, extending these models to video require some precaution:

– Differently from other fields such as computer vision, where the reconstruction error is the only cost criterion, in video coding one needs to consider the weighted contribution of distortion and bitrate of the associated representation. In the case of spatial and temporal prediction, codecs are extremely optimized to represent the existing prediction modes; therefore, this should be taken into account when formulating loss functions for these

algorithms;

– The spatial or temporal samples used for prediction are the reconstructed pixels after quantization. So far, methods for image compression based on deep learning have used simple approximations of quantization, the reason being that quantization is not differentiable. However, this aspect needs to be taken into consideration in a video codec, which is essentially based on a DPCM (differential pulse code modulation) architecture. In V-CODE we will target these differences and adapt the deep generative models consequently. Furthermore, we will embed the resulting predictors into an HEVC encoder to test the advantages over the current standard architecture.

**Objective 2**: estimating symbol probabilities based on previously decoded symbols using deep generative models.

Arithmetic coding enables to code symbols by approaching the entropy rate of the source, provided the data distribution is known. When only an approximation of the data distribution is available, there is a loss of coding efficiency, which is quantified by the KL divergence between the actual distribution and the used one. The contextadaptive binary arithmetic coder used in modern video codecs employs a set of predefined conditional probability

models, which are updated on-the-fly during encoding depending on the observed symbols. The problem of context dilution, described above, limits the number of different contexts that could be used. In V-CODE, we will consider a different approach: instead of using several conditional probabilities tables, which practically forces the context to be small, we will instead use a deep generative model with a large context, which could potentially involve all the previous coded symbols. The authors of 8 employ a 3D convolutional auto-encoder to learn the latent distribution of the features of another auto-encoder used for image compression. However, their model can accurately estimate the conditional entropy (which is a lower bound on the bitrate) as the value of the loss function at the optimum. In our case, we are interested in obtaining conditional probabilities to use with binary arithmetic coding, which calls for optimizing also the binarization in the process; moreover, we want to do so with symbols that might come from very different sources, e.g., motion vectors or prediction residuals. To this end, we will consider recurrent neural networks to model sequences of symbols; this approach has been successfully used, e.g., in image generation 13.

**Productions Scientifiques :**

- L. Wang, A. Fiandrotti, A. Purica, G. Valenzise and M. Cagnazzo, “Enhancing HEVC Spatial Prediction by Context-based Learning,” ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 4035-4039.