Video Super-Resolution With Convolutional Neural Networks

2y ago
5.61 MB
14 Pages
Last View : 1m ago
Last Download : 1y ago
Upload by : Abby Duckworth

IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 2, NO. 2, JUNE 2016109Video Super-Resolution With ConvolutionalNeural NetworksArmin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K. Katsaggelos, Fellow, IEEEAbstract—Convolutional neural networks (CNN) are a specialtype of deep neural networks (DNN). They have so far been successfully applied to image super-resolution (SR) as well as otherimage restoration tasks. In this paper, we consider the problemof video super-resolution. We propose a CNN that is trained onboth the spatial and the temporal dimensions of videos to enhancetheir spatial resolution. Consecutive frames are motion compensated and used as input to a CNN that provides super-resolvedvideo frames as output. We investigate different options of combining the video frames within one CNN architecture. While largeimage databases are available to train deep neural networks, itis more challenging to create a large video database of sufficientquality to train neural nets for video restoration. We show thatby using images to pretrain our model, a relatively small videodatabase is sufficient for the training of our model to achieveand even improve upon the current state-of-the-art. We compareour proposed approach to current video as well as image SRalgorithms.Index Terms—Deep Learning, Deep Neural Networks,Convolutional Neural Networks, Video Super-Resolution.I. I NTRODUCTIONIMAGE and video or multiframe super-resolution is theprocess of estimating a high resolution version of a low resolution image or video sequence. It has been studied for a longtime, but has become more prevalent with the new generationof Ultra High Definition (UHD) TVs (3,840 2,048). Mostvideo content is not available in UHD resolution. Therefore SRalgorithms are needed to generate UHD content from Full HD(FHD) (1,920 1080) or lower resolutions.SR algorithms can be divided into two categories, modelbased and learning-based algorithms. Model-based approaches[1]–[5] model the Low Resolution (LR) image as a blurred,subsampled version of the High Resolution (HR) image withadditive noise. The reconstruction of the HR image from theLR image is an ill-posed problem and therefore needs to beregularized. In a Bayesian framework, priors controlling thesmoothness or the total variation of the image are introducedin order to obtain the reconstructed HR image. For example, Babacan et al. [1] utilize the Bayesian framework toreconstruct an HR image from multiple LR observations, subject to rotation and translation amongst them. Belekos et al.Manuscript received August 13, 2015; revised February 03, 2016; acceptedFebruary 10, 2016. Date of publication March 30, 2016; date of current versionMay 03, 2016. The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Alessandro Foi.The authors are with the Department of Electrical Engineering andComputer Science, Northwestern University, Evanston, IL 60208 USA ( versions of one or more of the figures in this paper are available onlineat Object Identifier 10.1109/TCI.2016.2532323[2] and later Liu and Sun [3] also use the Bayesian framework to derive an algorithm that is able to deal with complexmotion and real world video sequences. With all these algorithms, the motion field and the HR reconstructed image,along with additionally required model parameters are estimated simultaneously from the observed data. Ma et al. [5]presented an algorithm that extended the same idea to handlemotion blur.Learning-based algorithms learn representations from largetraining databases of HR and LR image pairs [6]–[11] orexploit self-similarities within an image [10]–[13]. Dictionarybased approaches utilize the assumption that natural imagepatches can be sparsely represented as a linear combination oflearned dictionary patches or atoms. Yang et al. [6] were amongthe first to use two coupled dictionaries to learn a nonlinearmapping between the LR and the HR images. Improvementsand variations of [6] were represented in [7]–[10], [13]. Songet al. [14] propose a dictionary approach to video superresolution where the dictionary is learned on the fly. However,the authors assumed that sparsely existing keyframes in HRare available. Learning-based methods generally learn representations of patches and therefore also reconstruct an imagepatch by patch. In order to avoid artifacts along the patch edges,overlapping patches are used which leads to a considerablecomputational overhead.Inspired by the recent successes achieved with CNNs [15],[16], a new generation of image SR algorithms based on deepneural nets emerged [17]–[21], with very promising performances. The training of CNNs can be done efficiently by parallelization using GPU-accelerated computing. Neural networksare capable of processing and learning from large trainingdatabases such as ImageNet [22], while training a dictionaryon a dataset this size can be challenging. Moreover, once aCNN is trained, super-resolving an image is a purely feedforward process, which makes CNN based algorithms muchfaster than traditional approaches. In this paper, we introducea CNN framework for video SR.In the classification and retrieval domains, CNNs have beensuccessfully trained on video data [23], [24]. Training forrecovery purposes remains a challenging problem because thevideo quality requirements for the training database are highsince the output of the CNN is the actual video rather thanjust a label. Suitable videos for the SR task are uncompressed,feature-rich and should be separated by shots/scenes. We showthat by pretraining the CNN with images we can bypass thecreation of a large video database. Our proposed algorithmrequires only a small video database for training to achieve verypromising performance.2333-9403 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See standards/publications/rights/index.html for more information.

110IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 2, NO. 2, JUNE 2016The proposed CNN uses multiple LR frames as input toreconstruct one HR output frame. There are several ways ofextracting the temporal information inside the CNN architecture. We investigated different variations of combining theframes and demonstrate the advantages and disadvantages ofthese variations. Our main contributions can be summarized inthe following aspects: We introduce a video SR framework based on a CNN. We propose three different architectures, by modifyingeach time a different layer of a reference SR CNNarchitecture. We propose a pretraining procedure whereby we trainthe reference SR architecture on images and utilize theresulting filter coefficients to initialize the training of thevideo SR architectures. This improves the performance ofthe video SR architecture both in terms of accuracy andspeed. We introduce Filter Symmetry Enforcement, wichreduces the training time of VSRnet by almost 20%without sacrificing the quality of the reconstructed video. We apply an adaptive motion compensation scheme tohandle fast moving objects and motion blur in videosThe Caffe [41] model as well as the training and testing protocols are available at rest of the paper is organised as follows. We briefly introduce deep learning and review existing deep learning basedimage SR techniques in Section II. In Section III we explainour proposed framework. Sections IV contains our results andtheir evaluation and Section V concludes the paper.II. R ELATED W ORKA. Super-ResolutionMost of the state-of-the-art image SR algorithms arelearning-based algorithms that learn a nonlinear mappingbetween LR and HR patches using coupled dictionaries [6]–[9]. Overcomplete HR and LR dictionaries are jointly trainedon HR and LR image patches. Each LR image patch can berepresented as a sparse linear combination of atoms from theLR dictionary. The dictionaries are coupled via common coefficients a.k.a. representation weights. The dictionaries and thecoefficients can be found with standard sparse coding techniques such as K-SVD [25]. An HR patch can then be recoveredby finding the sparse coefficients for an observed LR patch andapplying them to the HR dictionary. Timofte et al. [10] considered replacing the single large overcomplete dictionary withseveral smaller complete dictionaries to remove the computationally expensive sparse coding step. This led to a significantfaster algorithm while maintaining the reconstruction accuracy.A variation of [10] was recently proposed by Schulter et al.[11]. A random forest model was trained instead of the coupled dictionaries for the LR to HR patch mapping. Glasner etal. [13] did not learn a dictionary from sample images. Insteadthey created a set of downscaled versions of the LR image withdifferent scaling factors. Then patches from the LR image werematched to the downscaled version of itself and its HR ‘parent’ patch was used to construct the HR image. Learning-basedalgorithms, although popular for image SR, are not very wellexplored for video SR. In [14] a dictionary based algorithm isapplied to video SR. The video is assumed to contain sparselyrecurring HR keyframes. The dictionary is learned on the flyfrom these keyframes while recovering HR video frames.Many of the early works in multiframe SR have focussed onreconstructing one HR image from a series of LR images usinga Bayesian framework [1]–[4]. The LR images were obtainedby blurring and subsampling the HR image and then applying different motions to each LR image, such as translationand rotation. These algorithms generally solve two problems:Registration estimation, where the motion between the LRimages is estimated, and image recovery, where the HR imageis estimated using the information recovered in the first step.Bayesian video SR methods [2], [3] followed the same concept but used a more sophisticated optical flow algorithm [3]or a hierarchical block matching method [2] to find the motionfield, in order to be able to deal with real world videos withmore complex motion schemes. Ma et al. [5] extended the previously mentioned work in order to handle videos with motionblur. They introduced a temporal relative sharpness prior, whichexcludes pixels that are severely blurred. Because the imagerecovery process is an ill-posted problem, image priors suchas constraints on the total variation [26] are introduced andthen a Bayesian framework is used to recover the HR image.An alternative method to the conventional motion estimationand image restoration scheme is presented in [27]. Insteadof explicit motion estimation, a 3-D Iterative Steering KernelRegression is proposed. The video is divided and processed inoverlapping 3D cubes (time and space). The method then recovers the HR image by approximating the pixels in the cubes witha 3D Taylor series.Most video SR algorithms depend on an accurate motionestimation between the LR frames. There is a plethora of techniques in the literature for estimating a dense motion field[28]–[30]. Optical flow techniques assume that the opticalflow is preserved over time. This information is utilized toform the optical flow equation connecting spatial and temporal gradients. Assuming local constancy of the optical flow, anover-determined system of equations is solved for determiningthe translational motion components per pixel with sub-pixelaccuracy.B. Deep Learning-Based Image ReconstructionDNNs have achieved state-of-the-art performance on a number of image classification and recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge(ILSVRC-2012) [15], [16]. However, they are not very widelyused yet for image reconstruction, much less for video reconstruction tasks. Research on image reconstruction using DNNsincludes denoising [31]–[33], impainting [31], deblurring [34]and rain drop removal [35]. In [17]–[21], deep learning isapplied to the image SR task. Dong et al. [17] pointed outthat each step of the dictionary based SR algorithm can be reinterpreted as a layer of a deep neural network. Representing animage patch of size f f with a dictionary with n atoms canbe interpreted as applying n filters with kernel size f f on

KAPPELER et al.: VIDEO SUPER-RESOLUTION WITH CONVOLUTIONAL NEURAL NETWORKSthe input image, which in turn can be implemented as a convolutional layer in a CNN. Accordingly, they created a CNN thatdirectly learns the nonlinear mapping from the LR to the HRimage by using a purely convolutional neural network with twohidden layers and one output layer. Their approach is describedin more detail in Section III-A.Wang et al. [19] introduced a patch-based method where aconvolutional autoencoder [36] is used to pretrain a SR modelon 33 33 pixel, mean-subtracted, normalized LR/HR patchpairs. Then the training patches are clustered according to theirsimilarity and one sub-model is fine-tuned on self-similar patchpairs for each cluster. As opposed to [18], which uses standardfully connected autoencoders, they used convolutional basedautoencoders which exploit the 2-dimensional data structureof an image. The training data was augmented with translation, rotation, and different zoom factors in order to allow themodel to learn more visually meaningful features. Althoughthis measure does increase the size of the training dataset,these augmentations do not occur in real image superresolution tasks. Moreover, although a convolutional architecture isused, the images have to be processed patch by patch due tothe sub-models, whereas this is not necessary for our proposedalgorithm.Cui et al. [18] proposed an algorithm that gradually increasesthe resolution of the LR image up to the desired resolution. It consists of a cascade of stacked collaborative localautoencoders (CLA). First, a non-local self-similarity search(NLSS) is performed in each layer of the cascade to reconstructhigh frequency details and textures of the image. The resultingimage is then processed by an autoencoder to remove structuredistortions and errors introduced by the NLSS step. The algorithm works with 7 7 pixel overlapping patches, which leadsto an overhead in computation. Besides, as opposed to [17] andour proposed algorithm, this method is not designed to be anend-to-end solution, since the CLA and NLSS of each layer ofthe cascade have to be optimized independently.Cheng et al. [20] introduced a patch-based video SR algorithm using fully connected layers. The network has two layers,one hidden and one output layer and uses 5 consecutive LRframes to reconstruct one center HR frame. The video is processed patchwise, where the input to the network is a 5 5 5volume and the output a reconstructed 3 3 patch from the HRimage. The 5 5 patches or the neighboring frames were foundby applying block matching using the reference patch and theneighboring frames. As opposed to our proposed SR method,[18] and [20] do not use convolutional layers and therefore donot exploit the two-dimensional data structure of images.Liao et al. [21] apply a similar approach which involvesmotion compensation on multiple frames and combiningframes using a convolutional neural network. Their algorithmworks in two stages. In the first stage, two motion compensation algorithms with 9 different parameter settings were utilizedto calculate SR drafts in order to deal with motion compensation errors. In the second stage, all drafts are combined using aCNN. However, calculating several motion compensations perframe is computationally very expensive. Our proposed adaptive motion compensation only requires one compensation andis still able to deal with strong motion blur (see Figure 10).111Fig. 1. Reference architecture for image super-resolution consisting of threeconvolutional layers.III. V IDEO S UPER -R ESOLUTION W ITH C ONVOLUTIONALN EURAL N ETWORKA. Single Frame/Image Super-ResolutionBefore we start the training of the video SR model, we pretrain the model weights on images. For the image pretraining,we use a model for image SR, henceforth referred to as a reference model, with the network architecture parameters proposedin [17]. It has only convolutional layers which has the advantagethat the input images can be of any size and the algorithm is notpatch-based. The setup is shown in Figure 1. In it Y representsthe input LR image and X the output HR image. It consistsof three convolutional layers, where the two hidden layers H1and H2 are followed by a Rectified Linear Unit (ReLU) [37].The first convolutional layer consists of 1 f1 f1 C1 filter coefficients, where f1 f1 is the kernel size and C1 thenumber of kernels in the first layer. We use this notation to indicate that the first dimension is defined by the number of inputimages, which is 1 for the image SR case. The filter dimensions of the second and third layers are C1 f2 f2 C2 andC2 f3 f3 1, respectively. The last layer can only haveone kernel in order to obtain an image as output. Otherwise anadditional layer with one kernel otherwise a postprocessing oraggregation step is required. The input image Y is bicubicallyupsampled so that the input (LR) and output (HR) images havethe same resolution. This is necessary because upsampling withstandard convolutional layers is not possible. A typical imageclassification architecture often contains pooling and normalization layers, which helps to create compressed layer outputsthat are invariant to small shifts and distortions of the inputimage. In the SR task, we are interested in creating more imagedetails rather than compressing them. Hence the introduction ofpooling and normalization layers would be counter productive.The model is trained on patches extracted from images fromthe ImageNet detection dataset [38], which consists of around400,000 images.B. Video Super-Resolution ArchitecturesIt has been shown for model-based approaches that including neighboring frames into the recovery process is beneficialfor video SR [2]–[4]. The motion between frames is modeled and estimated during the recovery process and additionalinformation is gained due to the subpixel motions amongframes. The additional information conveyed by these small

112IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, VOL. 2, NO. 2, JUNE 2016Fig. 2. Video SR architectures: In figure (a), the three input frames are concatenated (Concat Layer) before layer 1 is applied. Architecture (b) concatenates thedata between layers 1 and 2, and (c) between layers 2 and 3.frame differences can also be captured by a learning-basedapproach, if multiple frames are included in the trainingprocedure.For the video SR architecture, we include the neighboring frames into the process. Figure 2 shows three optionsfor incorporating the previous and next frames into the process. For simplicity, we only show the architecture for threeinput frames, namely the previous (t 1), current (t), and next(t 1) frames. Clearly, any number of past and future framescan be accomodated (for example, we use five input framesin the experimental section). In order to use more than oneforward- and backward-frame, the architectures in Figure 2can be extended with more branches. A single input framehas dimensions 1 M N , where M and N are the widthand height of the input image, respectively. For the architecture in (a), the three input frames are concatenated along thefirst dimension before the first convolutional layer is applied.The new input data for Layer 1 is 3-dimensional with size3 M N . In a similar fashion we can combine the framesafter the first layer, which is shown in architecture (b). Theoutput data of layer 1 is again concatenated along the firstdimension and then used as input to layer 2. In architecture(c) layers 1 and 2 are applied separately and the data is concatenated between layers 2 and 3. Not only the

Video Super-Resolution With Convolutional Neural Networks Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K. Katsaggelos, Fellow, IEEE Abstract—Convolutional neural networks (CNN) are a special type of deep neural networks (DNN). They have so far been suc-cessfully applied to image super-resolution (SR) as well as other image .

Related Documents:

AEQB Super QuickBooks-Export (i.e. Accounting-Export QuickBooks) BRW Super Browse DIA Super Dialer FF Super Field-Filler IE Super Import-Export INV Super Invoice LIM Super Limiter PCD Super Passcode QBE Super QBE SEC Super Security TAG Super Tagging MHSTF Super Stuff (a.k.a

Learning a Deep Convolutional Network for Image Super-Resolution . a deep convolutional neural network (CNN) [15] that takes the low- . Convolutional Neural Networks. Convolutional neural networks (CNN) date back decades [15] and have recently shown an explosive popularity par-

Both SAS SUPER 100 and SAS SUPER 180 are identified by the “SAS SUPER” logo on the right side of the instrument. The SAS SUPER 180 air sampler is recognizable by the SAS SUPER 180 logo that appears on the display when the operator turns on the unit. Rev. 9 Pg. 7File Size: 1MBPage Count: 40Explore furtherOperating Instructions for the SAS Super 180www.usmslab.comOPERATING INSTRUCTIONS AND MAINTENANCE MANUALassetcloud.roccommerce.netAir samplers, SAS Super DUO 360 VWRuk.vwr.comMAS-100 NT Manual PDF Calibration Microsoft“SAS SUPER 100/180”, “DUO SAS SUPER 360”, “SAS .archive-resources.coleparmer Recommended to you b

Super Mario 64 Super Mario 64 Randomizer Super Mario Bros. 2 Super Mario Bros. 3 Super Mario Kart Super Mario RPG Super Mario World Super Mario World 2: Yoshi’s Island Super Metroid Terraria The Binding of Isaac: Afterbirth ToeJam & Earl

Super Mario 64 Super Mario 64 Randomizer Super Mario Bros. 2 Super Mario Bros. 3 Super Mario Kart Super Mario RPG Super Mario World Super Mario World 2: Yoshi’s Island Super Metroid Terraria The Binding of Isaac: Afterbirth ToeJam & Earl ToeJam & Earl: Back i

Using Cross Products Video 1, Video 2 Determining Whether Two Quantities are Proportional Video 1, Video 2 Modeling Real Life Video 1, Video 2 5.4 Writing and Solving Proportions Solving Proportions Using Mental Math Video 1, Video 2 Solving Proportions Using Cross Products Video 1, Video 2 Writing and Solving a Proportion Video 1, Video 2

Both SAS SUPER 100 and SAS SUPER 180 are identified by the “SAS SUPER 100” logo on the right side of the instrument. International pbi S.p.AIn « Sas Super 100/180, Duo Sas 360, Sas Isolator » September 2006 Rev. 5 8 The SAS SUPER 180 air sampler is recognisable by the SAS SUPER 180 logo that appears on the display when the .File Size: 1019KB

Accounting for Nature: A Natural Capital Account of the RSPB’s estate in England 77. Puffin by Chris Gomersall ( 8. Humans depend on nature, not only for the provision of drinking water and food production, but also through the inspiring landscapes and amazing wildlife spectacles that enrich our lives. It is increasingly understood that protecting and enhancing the natural .