Object Detection And Semantic Segmentation - Home.cs.colorado.edu

1y ago
5 Views
3 Downloads
5.81 MB
99 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Fiona Harless
Transcription

Object Detection andSemantic SegmentationDanna GurariUniversity of Colorado BoulderSpring 2022https://home.cs.colorado.edu/ rse.html

Review Last lecture: Representation learningPretrained featuresFine-tuningTraining neural networks: hardware & softwareProgramming tutorial Assignments (Canvas) Lab assignment 2 due earlier today Problem set 3 due next week Questions?

Today’s Topics Problems Applications PASCAL VOC detection challenge: R-CNNs PASCAL VOC semantic segmentation challenge: fully convolutional networks

Today’s Topics Problems Applications PASCAL VOC detection challenge: R-CNNs PASCAL VOC semantic segmentation challenge: fully convolutional networks

Problem Definition Localize content of interestObject detectionSemantic segmentationLin et al. Microsoft COCO: Common Objects in Context. ECCV 2014.

Problem Definition Localize content of interestObject detectionUse bounding box to locateevery instance of an objectfrom pre-specified categoriesLin et al. Microsoft COCO: Common Objects in Context. ECCV 2014.

Problem Definition Localize content of interestSemantic segmentationLocate all pixels that belongto pre-specified categoriesLin et al. Microsoft COCO: Common Objects in Context. ECCV 2014.

Problem Definition Localize content of interestNote: instances of the sameclass are NOT separatedLocate all pixels that belongto pre-specified categoriesImage Source: Jeong, Yoon, and Park. Sensors 2018.

Object Segmentation vs Detection Why choose object “segmentation” vs “detection”?http://mmcheng.net/msra10k/

Today’s Topics Problems Applications PASCAL VOC detection challenge: R-CNNs PASCAL VOC semantic segmentation challenge: fully convolutional networks

Social MediaFace detection(e.g., Facebook)

BankingMobile check deposit(e.g., Bank of America)

TransportationLicense Plate Detection (e.g., AllGoVision)

Construction SafetyPedestrian Detection(e.g., rochure/3435/brochure.pdf

CountingCounting Fish (e.g., SalmonSoft)http://www.wecountfish.com/?page id 143Business Traffic Analytics

Remodeling InspirationBell et al; SIGGRAPH; 2013

Rotoscoping (many examples on creening

Disease Diagnosis; e.g.,Figure Source: sification

Face MakeoverDemo: ools

Self-Driving VehiclesFigure Source: -powered-by-people-playing-games-mighty-ai.html

Can you think of any other applications?

How to solve problems like these with neural networks?

Today’s Topics Problems Applications PASCAL VOC detection challenge: R-CNNs PASCAL VOC semantic segmentation challenge: fully convolutional networks

VOC Challenge Goal: locate all instances of 20object categories with BBs Dataset: 11,530 images collectedfrom Flickr and annotated byannotators at University of users-part-1/Dataset location: ark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and AndrewZisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV 2010.

MSCOCO Challenge: Evaluation Metric (IoU)Ground Truth:ScoreAlgorithm:

MSCOCO Challenge: Evaluation Metric (IoU)Ground Truth:2847(60%)Algorithm:Then, threshold:e.g., 50% or greatermeans correct detection!

MSCOCO Challenge: Evaluation Metric (mAP) For each object class (e.g., cat, dog, ), compute: Precision: fraction of correct detections from all detections when using a 0.5 IoUAlgorithm BB its Confidence[Russakovsky et al; IJCV age-precision-for-object-detection-45c121a31173

MSCOCO Challenge: Evaluation Metric (mAP) For each object class (e.g., cat, dog, ), compute: Precision: fraction of correct detections from all detections when using a 0.5 IoU? ? ? ?[Russakovsky et al; IJCV age-precision-for-object-detection-45c121a31173

MSCOCO Challenge: Evaluation Metric (mAP) For each object class (e.g., cat, dog, ), compute: Precision: fraction of correct detections from all detections when using a 0.5 IoU[Russakovsky et al; IJCV age-precision-for-object-detection-45c121a31173

Naïve Solution: Sliding Window Person?Person?Person?Image Source: own/

Naïve Solution: Sliding Window ApproachCar?Car?Car?Car?Car?Car?Car?Car?Car?Image Source: own/

Naïve Solution: Sliding Window ApproachWould thisdetect theperson?Image Source: own/

Naïve Solution: Sliding Window ApproachWould thisdetect thecar?Image Source: own/

Naïve Solution: Sliding Window ApproachCar?Need to test windows of different scales Car?Car?Car?Car?Car?Car?Car?Car?Car?Car?Image Source: own/

Naïve Solution: Sliding Window ApproachWould thisscale detectthe person?Image Source: own/

Naïve Solution: Sliding Window ApproachWould thisscale detectthe car?Image Source: own/

Naïve Solution: Sliding Window ApproachPerson?Need to test windows of different aspect ratios erson?Person?Image Source: own/

Naïve Solution: Sliding Window ApproachWould thisaspect ratiodetect theperson?Image Source: own/

Naïve Solution: Sliding Window Approach Sliding window approach: must test different locations at Different scales Different aspect ratios (e.g., person vs car or car taken at different angles) Number of regions to test? (e.g., 1920 x 1080 image) Easily can explode to hundreds of thousands or millions of windows Key limitation Very slow!

First programmable machineNeNoeff eurect al n cognive etitrowBackp learn orks ning wiropagastr thatetionfor gyCNNsMNIST, LeNeR-CtNN, FaImst RageN-CNAle etN,Fasxter NetR-CNNAIPerceptronMachine learning20020 9122014-519861989199819801919561957591945 1950Turing test1847Gradient descentHistorical Context: R-CNN Methods

R-CNN First CNN to outperformhand-crafted features ondetection challenges Named after technique:Region proposals withCNN featuresFigure Source: on-algorithms-part-1/

R-CNNLocate “object”-like regionsusing objectness methods Considerably fewer regionsthan sliding window approach Regions likely contain objectsof interest (i.e., high recall)Figure Source: on-algorithms-part-1/

R-CNNFigure Source: on-algorithms-part-1/

Describe Each Region with Fixed-length VectorFine-tune: replace final layer (FC8) of pretrained AlexNet (on ImageNet) to contain the number ofcategories in detection dataset and train for image classification (use max IoU class, if IoU 0.5)How many classes would be predicted in fine-tuning?Input: 227 x 227 x 3 imageImage Source: onvolutional-layers fig2 312303454

Describe Each Region with Fixed-length VectorUse the FC7 layer from afine-tuned AlexNet modelInput: 227 x 227 x 3 imageImage Source: onvolutional-layers fig2 312303454

Describe Each Region with Fixed-length VectorChallenge: how to resize a proposedregion to the required size?Image Source: onvolutional-layers fig2 312303454

Describe Each Region with Fixed-length VectorRegion anisotropically scaled tofit the required resolutionImage Source: onvolutional-layers fig2 312303454

Describe Each Region with Fixed-length VectorInput: 227 x 227 x 3 imageImage Source: onvolutional-layers fig2 312303454

R-CNN1. Classifier trained to use aregion’s CNN feature to assign acategory from pre-defined set2. Regressor trained to refineeach region’s position, width,and heightFigure Source: on-algorithms-part-1/

R-CNN: Region RefinementOriginal region proposal withcenter (px, py), width (pw), andheight (ph) is refined usingmodel parameters (dx, dy, dw, dy)Image Source: ox-regression

R-CNN Limitations Slow training procedure Must train three models Slow at test time( 1 minute per image)Figure Source: on-algorithms-part-1/

Fast R-CNN: Single Stage Training (rather than 3)For each region, assign itto a class and refine itExtract feature descriptionper proposed region withsection of feature mapcorresponding to regionFigure Source: on-algorithms-part-1/

Fast R-CNN Training: Multi-task LossLoss for each region proposal is sum of classification and localization lossesTrue labelSoftmax scoresSoftmax loss Box coordinates (x, y, w, h)True location (x’, y’, w’, h’)L2 lossTotal loss

Fast R-CNN Training: Multi-task LossLoss for each region proposal is sum of classification and localization lossesTrue labelSoftmax scoresSoftmax loss Box coordinates (x, y, w, h)True location (x’, y’, w’, h’)L2 lossTotal loss

Fast R-CNN Training: Classification Loss (Recap)Greater penalty whenpredicted probabilityof true class isconfidently wrongLesser penalty otherwiseFigure source: 3/softmax-and-the-negative-log-likelihood/

Fast R-CNN Training: Multi-task LossLoss for each region proposal is sum of classification and localization lossesTrue labelSoftmax scoresSoftmax loss Box coordinates (x, y, w, h)True location (x’, y’, w’, h’)L2 lossTotal loss

Fast R-CNN Training: Measure Localization LossTrue location fortrue class “u”Predicted locationfor class uLess sensitive tooutliers than SSEImage Source: ox-regression

Fast R-CNN: LimitationStill requires slow, initialstep of generatingregion proposalsFigure Source: on-algorithms-part-1/

Faster R-CNNAdds finding region proposals tonetwork so that all parts of modelare learned in end-to-end fashionConvolutional layers are shared forregion proposal and detectionRen Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” Neurips 2015.

Faster R-CNN: Region Proposal NetworkProbability ofobject/not objectParameters to refine anchor box to matchGT box (center, width, and height)Based on convolution, so usessliding window At each sliding window position,region proposals are predictedwith respect to an anchor point(i.e., center of sliding windowposition) At each anchor point, k 9anchors are used to represent 3scales and 3 aspect ratiosRen Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” Neurips 2015.

Faster R-CNN: Region Proposal NetworkAt training, loss for each region proposal is sum of classification and localization lossesBased on convolution, so usessliding window At each sliding window position,region proposals are predictedwith respect to an anchor point(i.e., center of sliding windowposition) At each anchor point, k 9anchors are used to represent 3scales and 3 aspect ratiosRen Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” Neurips 2015.

Faster R-CNN Training1. Train RPN2. Train Fast R-CNN usingproposals from pretrained RPN3. Fine-tune layers unique to RPN4. Fine-tune the fully connectedlayers of Fast R-CNNRen Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” Neurips 2015.

Today’s Topics Problems Applications PASCAL VOC detection challenge: R-CNNs PASCAL VOC semantic segmentation challenge: fully convolutional networks

VOC Challenge Goal: locate all pixels belonging to20 categories (e.g., person, cat,bus, mortorbike, potted plant,bottle) plus background Dataset: 11,530 images collectedfrom Flickr and annotated byannotators at University of LeedsDataset location: ark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and AndrewZisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV 2010.

VOC Challenge: Evaluation Metric (IoU)Ground Truth:ScoreAlgorithm:

VOC Challenge: Evaluation Metric (IoU)Ground Truth:?Algorithm:

VOC Challenge: Evaluation Metric (IoU)Mean IoU: IoU between predicted and ground-truth pixels, averaged over all 21 categoriesGround Truth:1927Algorithm:

Input: RGB image of ANY sizeOutput: Image of same size as inputArchitectureFor each image pixel,the probability ofeach class is predictedLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Output Layer e.g., assume a 5-class classifierSource: https://www.jeremyjordan.me/semantic-segmentation/

Architecture: Output Layer e.g., assume a 5-class classifier; output 1-hot encoding collapsed into single mask imageSource: https://www.jeremyjordan.me/semantic-segmentation/

Input: RGB image of ANY sizeOutput: Image of same size as inputArchitectureHow many classesare there?- 21Why 21?- 20 object classesplus backgroundLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

ArchitectureDo you recognizethis architecture?Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

ArchitectureCan use your favoritepretrained ImageNet classifier;AlexNet, VGG, GoogleNetLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

ArchitectureTo make the architecturefully convolutional, fullyconnected layers areconverted toconvolutional layers.In the absence of fullyconnected layers, thereare no constraints on thenumber of input nodes(and so any input imagesize can be supported).Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

ArchitectureAnother result ofthis change isthat, unlike forclassification, aclass can beassigned to each“coarse region.”Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Coarse Region Classification(Recall Intuition)UsingVGG16instead:Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Coarse Region Classification(Recall Intuition)Each line represents aconvolutional layerUsingVGG16instead:Grids reflect relative spatialcoarseness at each layerLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Coarse Region Classification(Recall Intuition)Stacking many convolutional layers leads to learning patterns in increasingly largerregions of the input (e.g., pixel) nvnets.html

Architecture: Fully vs Convolution LayersEach slice indicatesthe likelihood eachpixel in the coarseregion belongs tothe class identifiedby the filterLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Fully vs Convolution LayersIf convolutionizingImageNet trainedclassifiers, howmany classes wouldbe predicted foreach coarse region?Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Coarse Region ClassificationLocates 20 object classesplus background for VOCLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

ArchitectureChallenge: how to decode fromcoarse region classifications toper pixel classification?Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Upsampling (Many Approaches)Source: http://cs231n.stanford.edu/slides/2017/cs231n 2017 lecture11.pdf

Architecture: Upsampling(Transposed Convolutional Layer) Also called “fractional convolutional layer”, “backward convolution”, and, incorrectly,”deconvolution layer” Idea: learn filters with a fractional sized stride to upsample the coarse image while refiningit based on the filter values; l-reconstructing-the-original-input

Architecture: Upsampling(Transposed Convolutional Layer) Also called “fractional convolutional layer”, “backward convolution”, and, incorrectly,”deconvolution layer” Idea: learn filters with a fractional sized stride to upsample the coarse image while refiningit based on the filter values; e.g.,(padding is used tointermediate values)https://d2l.ai/chapter computer-vision/transposed-conv.html

Architecture: Upsampling(Transposed Convolutional Layer) Also called “fractional convolutional layer”, “backward convolution”, and, incorrectly,”deconvolution layer” Idea: learn filters with a fractional sized stride to upsample the coarse image while refiningit based on the filter values; e.g.,(stride is used to compute intermediate values)https://d2l.ai/chapter computer-vision/transposed-conv.html

ArchitectureNext challenge: how to decode a highlydetailed per pixel classification fromthe coarse region classifications?Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: ResultsNext challenge: how to decode a highlydetailed per pixel classification fromthe coarse region classifications?Figure source: https://www.jeremyjordan.me/semantic-segmentation/

Architecture: Update to Use Skip ConnectionsTrained 1more day toupdate theFCN-32 modelFCN16: Sums predictions of lowerlevel, more fine-grained features(pool4) with the predictions at thecoarser featuresLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: ResultsSkip connections support capturing finer-grained while retaining the correct semantic information!Figure source: https://www.jeremyjordan.me/semantic-segmentation/

Architecture: Upsampling Skip ConnectionsSeems complicated why not instead preserve theimage size and solve for per-pixel classification?- would result in unreasonable computationalburden due to many model parametersLong, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Architecture: Encoder Decoder ArchitectureFor efficiency, the image is encoded(downsampled) into a lower-resolutionfeature map that effectivelydiscriminates between classes Then, the featuremap is decoded(upsampled) into afull-resolutionsegmentation map.Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Training: Took 3 days on 1 GPU Repeat until stopping criterion met:1. Forward pass: propagatetraining data through modelto make prediction2. Quantify the dissatisfactionwith a model’s results on thetraining data3. Backward pass: usingpredicted output, calculategradients backward to assignblame to each modelparameter4. Update each parameterusing calculated gradientsFigure from: Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul,Jeffrey Mark Siskind; Automatic Differentiation in Machine Learning: a Survey; 2018

Training: How Neural Networks LearnSum across all pixels the distance between predictedand true distributions using cross entropy lossSum of gradients for all pixels (acts like a minibatch) Repeat until stopping criterion met:1. Forward pass: propagatetraining data through modelto make prediction2. Quantify the dissatisfactionwith a model’s results on thetraining data3. Backward pass: usingpredicted output, calculategradients backward to assignblame to each modelparameter4. Update each parameterusing calculated gradientsFigure from: Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul,Jeffrey Mark Siskind; Automatic Differentiation in Machine Learning: a Survey; 2018

Training: Cross Entropy Loss(Multinomial Logistic Loss) e.g., assume a 5-class classifier Distance between predictedand true distributions per pixelwith cross entropy loss

Architecture: Algorithm TrainingTraining updates weights ofpretrained network (aka, fine-tuning)Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

ResultsCompared to existing methods, produces better results at a faster speed!Long, Shelhamer, and Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.

Today’s Topics Problems Applications PASCAL VOC detection challenge: R-CNNs PASCAL VOC semantic segmentation challenge: fully convolutional networks

Based on convolution, so uses sliding window At each sliding window position, region proposals are predicted with respect to an anchor point (i.e., center of sliding window position) At each anchor point, k 9 anchors are used to represent 3 scales and 3 aspect ratios Probability of object/not object Parameters to refine anchor box to match

Related Documents:

Sparse Dictionaries for Semantic Segmentation 3 Paper Contributions. In this paper, we propose a novel framework for se-mantic segmentation based on a new CRF model with a top-down discriminative sparse dictionary learning cost. Our main contributions are the following: 1.A new categorization cost for semantic segmentation based on discriminative

Object Class: Independent Protection Layer Object: Safety Instrumented Function SIF-101 Compressor S/D Object: SIF-129 Tower feed S/D Event Data Diagnostics Bypasses Failures Incidences Activations Object Oriented - Functional Safety Object: PSV-134 Tower Object: LT-101 Object Class: Device Object: XS-145 Object: XV-137 Object: PSV-134 Object .

Object built-in type, 9 Object constructor, 32 Object.create() method, 70 Object.defineProperties() method, 43–44 Object.defineProperty() method, 39–41, 52 Object.freeze() method, 47, 61 Object.getOwnPropertyDescriptor() method, 44 Object.getPrototypeOf() method, 55 Object.isExtensible() method, 45, 46 Object.isFrozen() method, 47 Object.isSealed() method, 46

Fig. 1.Overview. First stage: Coarse segmentation with multi-organ segmentation withweighted-FCN, where we obtain the segmentation results and probability map for eachorgan. Second stage: Fine-scaled binary segmentation per organ. The input consists of cropped volume and a probability map from coarse segmentation.

Internal Segmentation Firewall Segmentation is not new, but effective segmentation has not been practical. In the past, performance, price, and effort were all gating factors for implementing a good segmentation strategy. But this has not changed the desire for deeper and more prolific segmentation in the enterprise.

Internal Segmentation Firewall Segmentation is not new, but effective segmentation has not been practical. In the past, performance, price, and effort were all gating factors for implementing a good segmentation strategy. But this has not changed the desire for deeper and more prolific segmentation in the enterprise.

segmentation research. 2. Method The method of segmentation refers to when the segments are defined. There are two methods of segmentation. They are a priori and post hoc. Segmentation requires that respondents be grouped based on some set of variables that are identified before data collection. In a priori segmentation, not only are the

Fig. 1: Proposed domain transfer method that readjusts a semantic segmentation network model to the domain of a different LiDAR sensor. Fig. 2: Sample scan in the projection-based representation (bottom) and its semantic segmentation (top). tation labels due to the required labor-intensive annotation work. Most of the computationally efcient .