Rich Feature Hierarchies For Accurate Object Detection And . - Ulsan

1y ago

31 Views

2 Downloads

1.63 MB

8 Pages

Last View : 3d ago

Last Download : 3m ago

Upload by : Aliana Wahl

Report this link

Download PDF

Transcription

Rich feature hierarchies for accurate object detection and semantic segmentationRoss Girshick Jeff Donahue Trevor Darrell Jitendra MalikUC eduAbstractR-CNN: Regions with CNN featureswarped regionObject detection performance, as measured on thecanonical PASCAL VOC dataset, has plateaued in the lastfew years. The best-performing methods are complex ensemble systems that typically combine multiple low-levelimage features with high-level context. In this paper, wepropose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30%relative to the previous best result on VOC 2012—achievinga mAP of 53.3%. Our approach combines two key insights:(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order tolocalize and segment objects and (2) when labeled trainingdata is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significantperformance boost. Since we combine region proposalswith CNNs, we call our method R-CNN: Regions with CNNfeatures. Source code for the complete system is available athttp://www.cs.berkeley.edu/ rbg/rcnn.aeroplane? no.person? yes.CNN.tvmonitor? no.1. Inputimage2. Extract regionproposals ( 2k)3. ComputeCNN features4. ClassifyregionsFigure 1: Object detection system overview. Our system (1)takes an input image, (2) extracts around 2000 bottom-up regionproposals, (3) computes features for each proposal using a largeconvolutional neural network (CNN), and then (4) classifies eachregion using class-specific linear SVMs. R-CNN achieves a meanaverage precision (mAP) of 53.7% on PASCAL VOC 2010. Forcomparison, [34] reports 35.1% mAP using the same region proposals, but with a spatial pyramid and bag-of-visual-words approach. The popular deformable part models perform at 33.4%.gorithm. Building on Rumelhart et al. [30], LeCun et al.[24] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs), a class of models that extend the neocognitron.CNNs saw heavy use in the 1990s (e.g., [25]), but thenfell out of fashion with the rise of support vector machines.In 2012, Krizhevsky et al. [23] rekindled interest in CNNsby showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9, 10]. Their success resulted from training a large CNN on 1.2 million labeled images, togetherwith a few twists on LeCun’s CNN (e.g., max(x, 0) rectifying non-linearities and “dropout” regularization).The significance of the ImageNet result was vigorouslydebated during the ILSVRC 2012 workshop. The centralissue can be distilled to the following: To what extent dothe CNN classification results on ImageNet generalize toobject detection results on the PASCAL VOC Challenge?We answer this question by bridging the gap betweenimage classification and object detection. This paper is thefirst to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as comparedto systems based on simpler HOG-like features. To achievethis result, we focused on two problems: localizing objects1. IntroductionFeatures matter. The last decade of progress on variousvisual recognition tasks has been based considerably on theuse of SIFT [27] and HOG [7]. But if we look at performance on the canonical visual recognition task, PASCALVOC object detection [13], it is generally acknowledgedthat progress has been slow during 2010-2012, with smallgains obtained by building ensemble systems and employing minor variants of successful methods.SIFT and HOG are blockwise orientation histograms,a representation we could associate roughly with complexcells in V1, the first cortical area in the primate visual pathway. But we also know that recognition occurs severalstages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features thatare even more informative for visual recognition.Fukushima’s “neocognitron” [17], a biologicallyinspired hierarchical and shift-invariant model for patternrecognition, was an early attempt at just such a process.The neocognitron, however, lacked a supervised training al1

with a deep network and training a high-capacity modelwith only a small quantity of annotated detection data.Unlike image classification, detection requires localizing (likely many) objects within an image. One approachframes localization as a regression problem. However, workfrom Szegedy et al. [33], concurrent with our own, indicates that this strategy may not fare well in practice (theyreport a mAP of 30.5% on VOC 2007 compared to the58.5% achieved by our method). An alternative is to build asliding-window detector. CNNs have been used in this wayfor at least two decades, typically on constrained object categories, such as faces [29, 35] and pedestrians [31]. In orderto maintain high spatial resolution, these CNNs typicallyonly have two convolutional and pooling layers. We alsoconsidered adopting a sliding-window approach. However,units high up in our network, which has five convolutionallayers, have very large receptive fields (195 195 pixels)and strides (32 32 pixels) in the input image, which makesprecise localization within the sliding-window paradigm anopen technical challenge.Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm [19],which has been successful for both object detection [34] andsemantic segmentation [5]. At test time, our method generates around 2000 category-independent region proposals forthe input image, extracts a fixed-length feature vector fromeach proposal using a CNN, and then classifies each regionwith category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNNinput from each region proposal, regardless of the region’sshape. Figure 1 presents an overview of our method andhighlights some of our results. Since our system combinesregion proposals with CNNs, we dub the method R-CNN:Regions with CNN features.A second challenge faced in detection is that labeled datais scarce and the amount currently available is insufficientfor training a large CNN. The conventional solution to thisproblem is to use unsupervised pre-training, followed by supervised fine-tuning (e.g., [31]). The second principle contribution of this paper is to show that supervised pre-trainingon a large auxiliary dataset (ILSVRC), followed by domainspecific fine-tuning on a small dataset (PASCAL), is aneffective paradigm for learning high-capacity CNNs whendata is scarce. In our experiments, fine-tuning for detectionimproves mAP performance by 8 percentage points. Afterfine-tuning, our system achieves a mAP of 54% on VOC2010 compared to 33% for the highly-tuned, HOG-baseddeformable part model (DPM) [15, 18]. We also point readers to contemporaneous work by Donahue et al. [11], whoshow that Krizhevsky’s CNN can be used (without finetuning) as a blackbox feature extractor, yielding excellentperformance on several recognition tasks including sceneclassification, fine-grained sub-categorization, and domainadaptation.Our system is also quite efficient. The only class-specificcomputations are a reasonably small matrix-vector productand greedy non-maximum suppression. This computationalproperty follows from features that are shared across all categories and that are also two orders of magnitude lowerdimensional than previously used region features (cf. [34]).Understanding the failure modes of our approach is alsocritical for improving it, and so we report results from thedetection analysis tool of Hoiem et al. [21]. As an immediate consequence of this analysis, we demonstrate that a simple bounding box regression method significantly reducesmislocalizations, which are the dominant error mode.Before developing technical details, we note that becauseR-CNN operates on regions it is natural to extend it to thetask of semantic segmentation. With minor modifications,we also achieve competitive results on the PASCAL VOCsegmentation task, with an average segmentation accuracyof 47.9% on the VOC 2011 test set.2. Object detection with R-CNNOur object detection system consists of three modules.The first generates category-independent region proposals.These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length featurevector from each region. The third module is a set of classspecific linear SVMs. In this section, we present our designdecisions for each module, describe their test-time usage,detail how their parameters are learned, and show results onPASCAL VOC 2010-12.2.1. Module designRegion proposals. A variety of recent papers offer methods for generating category-independent region proposals.Examples include: objectness [1], selective search [34],category-independent object proposals [12], constrainedparametric min-cuts (CPMC) [5], multi-scale combinatorialgrouping [3], and Cireşan et al. [6], who detect mitotic cellsby applying a CNN to regularly-spaced square crops, whichare a special case of region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with priordetection work (e.g., [34, 36]).Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [22]implementation of the CNN described by Krizhevsky etal. [23]. Features are computed by forward propagating amean-subtracted 227 227 RGB image through five convolutional layers and two fully connected layers. We referreaders to [22, 23] for more network architecture details.

aeroplanebicyclebirdcarFigure 2: Warped training samples from VOC 2007 train.In order to compute features for a region proposal, wemust first convert the image data in that region into a formthat is compatible with the CNN (its architecture requiresinputs of a fixed 227 227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we optfor the simplest. Regardless of the size or aspect ratio of thecandidate region, we warp all pixels in a tight bounding boxaround it to the required size. Prior to warping, we dilate thetight bounding box so that at the warped size there are exactly p pixels of warped image context around the originalbox (we use p 16). Figure 2 shows a random sampling ofwarped training regions. The supplementary material discusses alternatives to warping.2.2. Test-time detectionAt test time, we run selective search on the test imageto extract around 2000 region proposals (we use selectivesearch’s “fast mode” in all experiments). We warp eachproposal and forward propagate it through the CNN in order to read off features from the desired layer. Then, foreach class, we score each extracted feature vector using theSVM trained for that class. Given all scored regions in animage, we apply a greedy non-maximum suppression (foreach class independently) that rejects a region if it has anintersection-over-union (IoU) overlap with a higher scoringselected region larger than a learned threshold.Run-time analysis. Two properties make detection efficient. First, all CNN parameters are shared across all categories. Second, the feature vectors computed by the CNNare low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-wordencodings. The features used in the UVA detection system[34], for example, are two orders of magnitude larger thanours (360k vs. 4k-dimensional).The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPUor 53s/image on a CPU) is amortized over all classes. Theonly class-specific computations are dot products betweenfeatures and SVM weights and non-maximum suppression.In practice, all dot products for an image are batched intoa single matrix-matrix product. The feature matrix is typically 2000 4096 and the SVM weight matrix is 4096 N ,where N is the number of classes.This analysis shows that R-CNN can scale to thousandsof object classes without resorting to approximate tech-niques, such as hashing. Even if there were 100k classes,the resulting matrix multiplication takes only 10 seconds ona modern multi-core CPU. This efficiency is not merely theresult of using region proposals and shared features. TheUVA system, due to its high-dimensional features, wouldbe two orders of magnitude slower while requiring 134GBof memory just to store 100k linear predictors, compared tojust 1.5GB for our lower-dimensional features.It is also interesting to contrast R-CNN with the recentwork from Dean et al. on scalable detection using DPMsand hashing [8]. They report a mAP of around 16% on VOC2007 at a run-time of 5 minutes per image when introducing10k distractor classes. With our approach, 10k detectors canrun in about a minute on a CPU, and because no approximations are made mAP would remain at 59% (Section 3.2).2.3. TrainingSupervised pre-training. We discriminatively pre-trainedthe CNN on a large auxiliary dataset (ILSVRC 2012) withimage-level annotations (i.e., no bounding box labels). Pretraining was performed using the open source Caffe CNNlibrary [22]. In brief, our CNN nearly matches the performance of Krizhevsky et al. [23], obtaining a top-1 error rate2.2 percentage points higher on the ILSVRC 2012 validation set. This discrepancy is due to simplifications in thetraining process.Domain-specific fine-tuning. To adapt our CNN to thenew task (detection) and the new domain (warped VOCwindows), we continue stochastic gradient descent (SGD)training of the CNN parameters using only warped region proposals from VOC. Aside from replacing the CNN’sImageNet-specific 1000-way classification layer with a randomly initialized 21-way classification layer (for the 20VOC classes plus background), the CNN architecture is unchanged. We treat all region proposals with 0.5 IoU overlap with a ground-truth box as positives for that box’s classand the rest as negatives. We start SGD at a learning rateof 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering theinitialization. In each SGD iteration, we uniformly sample32 positive windows (over all classes) and 96 backgroundwindows to construct a mini-batch of size 128. We biasthe sampling towards positive windows because they are extremely rare compared to background.Object category classifiers. Consider training a binaryclassifier to detect cars. It’s clear that an image regiontightly enclosing a car should be a positive example. Similarly, it’s clear that a background region, which has nothingto do with cars, should be a negative example. Less clearis how to label a region that partially overlaps a car. We resolve this issue with an IoU overlap threshold, below whichregions are defined as negatives. The overlap threshold, 0.3,

VOC 2010 testDPM v5 [18]†UVA [34]Regionlets [36]SegDPM [16]†R-CNNR-CNN .840.941.4dog17.836.535.840.466.670.0horse mbike person plant46.4 51.247.7 10.843.5 52.932.9 15.340.2 55.743.5 14.348.3 54.447.1 14.857.8 65.953.6 26.762.0 69.058.1 45.943.150.252.4mAP33.435.139.740.450.253.7Table 1: Detection average precision (%) on VOC 2010 test. R-CNN is most directly comparable to UVA and Regionlets since allmethods use selective search region proposals. Bounding box regression (BB) is described in Section 3.4. At publication time, SegDPMwas the top-performer on the PASCAL VOC leaderboard. † DPM and SegDPM use context rescoring not used by the other methods.was selected by a grid search over {0, 0.1, . . . , 0.5} on avalidation set. We found that selecting this threshold carefully is important. Setting it to 0.5, as in [34], decreasedmAP by 5 points. Similarly, setting it to 0 decreased mAPby 4 points. Positive examples are defined simply to be theground-truth bounding boxes for each class.Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since thetraining data is too large to fit in memory, we adopt thestandard hard negative mining method [15, 32]. Hard negative mining converges quickly and in practice mAP stopsincreasing after only a single pass over all images.In supplementary material we discuss why the positiveand negative examples are defined differently in fine-tuningversus SVM training. We also discuss why it’s necessaryto train detection classifiers rather than simply use outputsfrom the final layer (fc8 ) of the fine-tuned CNN.2.4. Results on PASCAL VOC 2010-12Following the PASCAL VOC best practices [13], wevalidated all design decisions and hyperparameters on theVOC 2007 dataset (Section 3.2). For final results on theVOC 2010-12 datasets, we fine-tuned the CNN on VOC2012 train and optimized our detection SVMs on VOC 2012trainval. We submitted test results to the evaluation serveronly once for each of the two major algorithm variants (withand without bounding box regression).Table 1 shows complete results on VOC 2010. We compare our method against four strong baselines, includingSegDPM [16], which combines DPM detectors with theoutput of a semantic segmentation system [4] and uses additional inter-detector context and image-classifier rescoring. The most germane comparison is to the UVA systemfrom Uijlings et al. [34], since our systems use the same region proposal algorithm. To classify regions, their methodbuilds a four-level spatial pyramid and populates it withdensely sampled SIFT, Extended OpponentSIFT, and RGBSIFT descriptors, each vector quantized with 4000-wordcodebooks. Classification is performed with a histogramintersection kernel SVM. Compared to their multi-feature,non-linear kernel SVM approach, we achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while alsobeing much faster (Section 2.2). Our method achieves similar performance (53.3% mAP) on VOC 2011/12 test.3. Visualization, ablation, and modes of error3.1. Visualizing learned featuresFirst-layer filters can be visualized directly and are easyto understand [23]. They capture oriented edges and opponent colors. Understanding the subsequent layers is morechallenging. Zeiler and Fergus present a visually attractive deconvolutional approach in [37]. We propose a simple(and complementary) non-parametric method that directlyshows what the network learned.The idea is to single out a particular unit (feature) in thenetwork and use it as if it were an object detector in its ownright. That is, we compute the unit’s activations on a largeset of held-out region proposals (about 10 million), sort theproposals from highest to lowest activation, perform nonmaximum suppression, and then display the top-scoring regions. Our method lets the selected unit “speak for itself”by showing exactly which inputs it fires on. We avoid averaging in order to see different visual modes and gain insightinto the invariances computed by the unit.We visualize units from layer pool5 , which is the maxpooled output of the network’s fifth and final convolutionallayer. The pool5 feature map is 6 6 256 9216dimensional. Ignoring boundary effects, each pool5 unit hasa receptive field of 195 195 pixels in the original 227 227pixel input. A central pool5 unit has a nearly global view,while one near the edge has a smaller, clipped support.Each row in Figure 3 displays the top 16 activations fora pool5 unit from a CNN that we fine-tuned on VOC 2007trainval. Six of the 256 functionally unique units are visualized (the supplementary material includes more). Theseunits were selected to show a representative sample of whatthe network learns. In the second row, we see a unit thatfires on dog faces and dot arrays. The unit corresponding tothe third row is a red blob detector. There are also detectorsfor human faces and more abstract patterns such as text andtriangular structures with windows. The network appearsto learn a representation that combines a small number ofclass-tuned features together with a distributed representa-

.80.80.70.70.70.70.70.70.70.70.70.70.7Figure 3: Top regions for six pool5 units. Receptive fields and activation values are drawn in white. Some units are aligned to concepts,such as people (row 1) or text (4). Other units capture texture and material properties, such as dot arrays (2) and specular reflections (6).VOC 2007 testR-CNN pool5R-CNN fc6R-CNN fc7R-CNN FT pool5R-CNN FT fc6R-CNN FT fc7R-CNN FT fc7 BBaero51.859.357.658.263.564.268.1DPM v5 [18]DPM ST [26]DPM HSC [28]33.2 60.3 10.2 16.1 27.3 54.3 58.2 23.0 20.0 24.1 26.7 12.7 58.123.8 58.2 10.5 8.5 27.1 50.4 52.0 7.3 19.2 22.8 18.1 8.0 55.932.2 58.3 11.5 16.3 30.6 49.9 54.8 23.5 21.5 27.7 34.0 13.7 161.2horse mbike person plant56.6 58.742.4 23.452.5 58.544.6 25.651.6 55.943.3 23.357.7 59.045.8 28.160.1 64.252.2 31.360.6 66.854.2 31.569.1 68.658.7 1.1 36.1 46.0 43.5 33.715.9 22.8 46.2 44.9 29.123.5 34.4 47.4 45.2 34.3Table 2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance without fine-tuning. Rows 4-6 showresults for the CNN pre-trained on ILSVRC 2012 and then fine-tuned (FT) on VOC 2007 trainval. Row 7 includes a simple bounding boxregression (BB) stage that reduces localization errors (Section 3.4). Rows 8-10 present DPM methods as a strong baseline. The first usesonly HOG, while the next two use different feature learning approaches to augment or replace HOG.tion of shape, texture, color, and material properties. Thesubsequent fully connected layer fc6 has the ability to modela large set of compositions of these rich features.3.2. Ablation studiesPerformance layer-by-layer, without fine-tuning. To understand which layers are critical for detection performance,we analyzed results on the VOC 2007 dataset for each of theCNN’s last three layers. Layer pool5 was briefly describedin Section 3.1. The final two layers are summarized below.Layer fc6 is fully connected to pool5 . To compute features, it multiplies a 4096 9216 weight matrix by the pool5feature map (reshaped as a 9216-dimensional vector) andthen adds a vector of biases. This intermediate vector iscomponent-wise half-wave rectified (x max(0, x)).Layer fc7 is the final layer of the network. It is implemented by multiplying the features computed by fc6 by a4096 4096 weight matrix, and similarly adding a vectorof biases and applying half-wave rectification.We start by looking at results from the CNN withoutfine-tuning on PASCAL, i.e. all CNN parameters were pretrained on ILSVRC 2012 only. Analyzing performancelayer-by-layer (Table 2 rows 1-3) reveals that features fromfc7 generalize worse than features from fc6 . This meansthat 29%, or about 16.8 million, of the CNN’s parameterscan be removed without degrading mAP. More surprising isthat removing both fc7 and fc6 produces quite good resultseven though pool5 features are computed using only 6% ofthe CNN’s parameters. Much of the CNN’s representationalpower comes from its convolutional layers, rather than fromthe much larger densely connected layers. This finding suggests potential utility in computing a dense feature map, inthe sense of HOG, of an arbitrary-sized image by using onlythe convolutional layers of the CNN. This representationwould enable experimentation with sliding-window detectors, including DPM, on top of pool5 features.

R CNN FT fc7: animals025LocSimOthBG6040200100 400 1600 6400 25total false positivesR CNN fc6: furniture6020025LocSimOthBG806040200100 400 1600 6400 25total false positives604020LocSimOthBGLocSimOthBG100 400 1600 6400total false positivesR CNN FT fc7 BB: furniture10010080800100 400 1600 6400 25total false positivesR CNN FT fc7: furniture10040LocSimOthBGpercentage of each type2080percentage of each type60percentage of each typepercentage of each type8040R CNN FT fc7 BB: animals100100percentage of each typeComparison to recent feature learning methods. Relatively few feature learning methods have been tried on PASCAL VOC detection. We look at two recent approaches thatbuild on deformable part models. For reference, we also include results for the standard HOG-based DPM [18].The first DPM feature learning method, DPM ST [26],augments HOG features with histograms of “sketch token”probabilities. Intuitively, a sketch token is a tight distribution of contours passing through the center of an imagepatch. Sketch token probabilities are computed at each pixelby a random forest that was trained to classify 35 35 pixelpatches into one of 150 sketch tokens or background.The second method, DPM HSC [28], replaces HOG withhistograms of sparse codes (HSC). To compute an HSC,sparse code activations are solved for at each pixel usinga learned dictionary of 100 7 7 pixel (grayscale) atoms.The resulting activations are rectified in three ways (full andboth half-waves), spatially pooled, unit 2 normalized, andthen power transformed (x sign(x) x α ).All R-CNN variants strongly outperform the three DPMbaselines (Table 2 rows 8-10), including the two that usefeature learning. Compared to the latest version of DPM,which uses only HOG features, our mAP is more than 20percentage points higher: 54.2% vs. 33.7%—a 61% relative improvement. The combination of HOG and sketch tokens yields 2.5 mAP points over HOG alone, while HSCimproves over HOG by 4 mAP points (when comparedinternally to their private DPM baselines—both use nonpublic implementations of DPM that underperform the opensource version [18]). These methods achieve mAPs of29.1% and 34.3%, respectively.R CNN fc6: animals100percentage of each typePerformance layer-by-layer, with fine-tuning. We nowlook at results from our CNN after having fine-tuned its parameters on VOC 2007 trainval. The improvement is striking (Table 2 rows 4-6): fine-tuning increases mAP by 8.0percentage points to 54.2%. The boost from fine-tuning ismuch larger for fc6 and fc7 than for pool5 , which suggeststhat the pool5 features learned from ImageNet are generaland that most of the improvement is gained from learningdomain-specific non-linear classifiers on top of them.806040200100 400 1600 6400 25total false positivesLocSimOthBG100 400 1600 6400total false positivesFigure 4: Distribution of top-ranked false positive (FP) types.Each plot shows the evolving distribution of FP types as more FPsare considered in order of decreasing score. Each FP is categorized into 1 of 4 types: Loc—poor localization (a detection withan IoU overlap with the correct class between 0.1 and 0.5, or a duplicate); Sim—confusion with a similar category; Oth—confusionwith a dissimilar object category; BG—a FP that fired on background. Compared with DPM (see [21]), significantly more ofour errors result from poor localization, rather than confusion withbackground or other object classes, indicating that the CNN features are much more discriminative than HOG. Loose localization likely results from our use of bottom-up region proposals andthe positional invariance learned from pre-training the CNN forwhole-image classification. Column three shows how our simplebounding box regression method fixes many localization errors.3.4. Bounding box regressionBased on the error analysis, we implemented a simplemethod to reduce localization errors. Inspired by the bounding box regression employed in DPM [15], we train a linearregression model to predict a new detection window giventhe pool5 features for a selective search region proposal.Full details are given in the supplementary material. Results in Table 1, Table 2, and Figure 4 show that this simpleapproach fixes a large number of mislocalized detections,boosting mAP by 3 to 4 points.3.3. Detection error analysis4. Semantic segmentationWe applied the excellent detection analysis tool fromHoiem et al. [21] in order to reveal our method’s errormodes, understand how fine-tuning changes them, and tosee how our error types compare with DPM. A full summary of the analysis tool is beyond the scope of this paper and we encourage readers to consult [21] to understandsome finer details (such as “normalized AP”). Since theanalysis is best absorbed in the context of the associatedplots, we present the discussion within the captions of Figure 4 and Figure 5.Region classification is a standard technique for semantic segmentation, allowing us to easily apply R-CNN to thePASCAL VOC segmentation challenge. To facilitate a direct comparison with the current leading semantic segmentation system (called O2 P for “second-order pooling”) [4],we work within their open source framework. O2 P usesCPMC to generate 150 region proposals per image and thenpredicts the quality

We use a simple tech-nique (afﬁne image warping) to compute a ﬁxed-size CNN input from each region proposal, regardless of the region's shape. Figure1presents an overview of our method and highlights some of our results. Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.

Related Documents:

Bruksanvisning för bilstereo Bruksanvisning for bilstereo ... - Jula

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

375 Views

1y ago

Chapter 3 A Theory of Tonal Hierarchies in Music

3 A Theory of Tonal Hierarchies in Music 55 abstract musical structure of a culture or genre" (Bharucha 1984, p. 421). So, unlike tonal hierarchies that refer to cognitive representations of the structure of music across different pieces of music in the style, event hierarchies refer to a particular

15 Views

1y ago

10 tips och tricks för att lyckas med ert sap-projekt

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

736 Views

2y ago

Nordens 25 största medieföretag efter omsättning

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

333 Views

1y ago

SS 02 52 68 Ljudklassning av utrymmen i byggnader - byggtjanst.se

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

357 Views

1y ago

Apple Developer Program License Agreement (Swedish)

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

344 Views

1y ago

Robert T. Kiyosaki & Sharon L. Lechter Rich Dad Poor Dad What the Rich Teach Their Kids About Money that the Poor and Middle Class Do Not Rich Dad’s CASHFLOW Quadrant Rich Dad’s Guide to Financial Freedom Rich Dad’s Guide to Investing What the Rich Invest In that the Poor and Middle Class Do Not Rich Dad’s Rich Kid Smart Kid

171 Views

3y ago

Workday COA - Basic data on Hierarchies

hierarchy values, but do not let users drill through the levels. In these instances, users see a full list containing all levels of the hierarchy. 4. Hierarchies will be governed by Chart of Accounts and maintained in the system by the Financial Systems & Solutions team. Hierarchies - General Info

8 Views

5m ago

Recent Views

Novell SUSE Linux Package Description and Support Level .

aspell-eo An Esperanto Dictionary for Aspell L2 aspell-es A Spanish Dictionary for ASpell L2 aspell-et An Estonian dictionary for aspell L2 aspell-fa A Persian dictionary for aspell L2 aspell-fi Finnish Dictionary Package L2 aspell-fo A Faroese Dictionary for ASpell L2 aspell-fr A French Dictionary for ASpell L2 aspell-ga An Irish Dictionary .

2y ago

348 Views

Dictionary of Aviation - THE AIRLINE PILOTS

Dictionary of Accounting 0 7475 6991 6 . Dictionary of Computing 0 7475 6622 4 Dictionary of Economics 0 7136 8203 5 Dictionary of Environment and Ecology 0 7475 7201 1 Dictionary of Food Science and Nutrition 0 7136 7784 8 Dictionary of Human Resources and Personnel Management 0 7136 8142 X

2y ago

162 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Oxford and the Dictionary - Oxford English Dictionary

What makes an Oxford Dictionary? People find dictionary-making fascinating. The 250th anniversary of Samuel Johnson’s Dictionary in 2005 was widely celebrated, and the recent BBC television series Balderdash and Piffle had a huge response to its call to viewers to help track down elusive word and phrase or

2y ago

210 Views

Cambridge Essential English Dictionary

These Dictionary Guide Worksheets are downloadable versions of the Guide to the Dictionary presented in the Cambridge Essential English Dictionary, Second Edition. The Guide is designed to help you develop skills in using a dictionary. The worksheets are grouped as five separate units, whi

2y ago

516 Views

The Interactive Arabic Dictionary: Another Collaboratively .

the Interactive Arabic Dictionary” [11], and “Conceptual Design of the Interactive Arabic Dictionary” [12], were the main studies used in HIAST to implement the Interactive dictionary. 2.1. Objectives IAD is a Monolingual dictionary (Arabic-Arabic), targeted to

2y ago

333 Views

Dictionary-guided Scene Text Recognition

A dictionary is an explicit language model, and the ben-eﬁts of a dictionary for scene text recognition are well es-tablished. In most previous works, a dictionary was used to ensure that the output sequence of characters is a legit-imate word from the dictionary, and it improved the accu-r

2y ago

313 Views

Going Online with a German Collocations Dictionary - unibas.ch

dictionary articles on two levels: a minimalistic view for the search and navigation stage and a more detailed view once a collocation is found. Keywords: online dictionary, collocations, dictionary design, learners' dictionary, German language . 1. Introduction Many dictionaries are available on the Web today. However, as yet there areno well-

7m ago

66 Views

A Fault Dictionary-Based Fault Diagnosis Approach for CMOS Analog .

Step 5: Fault dictionary construction: The fault dictionary is a collection of potential faulty and fault-free responses. The signatures obtained will be stored in the dictionary. This dictionary involves for each fault a correspondence between the faulty circuit responses and the defect sites.

4m ago

56 Views

On Entries for Neologisms in English-Chinese Learner's Dictionaries

A New English Chinese Dictionary of Journalism (2007) by Hu Zhiyong, An English -Chinese Dictionary of Neologisms (2009) by Li Mingyi, English-Chinese Neologism Dictionary (2013) by Wu Xuemei, A Dictionary of New Chinese Phrases in English (2015) by China Daily and A Chinese-English Dictionary of New Words and Expressions (2015) by Wu .

4m ago

63 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Ross E. Davies, George Mason University School of Law

Jan 15, 2012 · 4. Bryan A. Garner, Preface to the First Pocket Edition of BLACK‘S LAW DICTIONARY, reprinted in BLACK‘S LAW DICTIONARY vii (3d Pocket ed. 2006). Garner is the current editor-in-chief of Black‟s Law Dictionary and (even more surely than was Black in his own time) the most influential contemporary scho-lar of American legal language. 5.

2y ago

297 Views

Rich Feature Hierarchies For Accurate Object Detection And . - Ulsan

It looks like you're using an ad-blocker