Toward Dietary Assessment Via Mobile Phone Video Cameras

2y ago
15 Views
2 Downloads
2.06 MB
5 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Aliana Wahl
Transcription

Toward Dietary Assessment via Mobile Phone Video CamerasNicholas Chen, MS1, Yun Young Lee, BS1, Maurice Rabb, MS1, Bruce Schatz, PhD21Department of Computer Science, University of Illinois, Urbana, IL 61801 USA;2Department of Medical Information Science, University of Illinois, Urbana, IL 61801 USAAbstractReliable dietary assessment is a challenging yet essential task for determining general health. Existingefforts are manual, require considerable effort, andare prone to underestimation and misrepresentationof food intake. We propose leveraging mobile phonesto make this process faster, easier and automatic.Using mobile phones with built-in video cameras,individuals capture short videos of their meals; oursoftware then automatically analyzes the videos torecognize dishes and estimate calories. Preliminaryexperiments on 20 typical dishes from a local cafeteria show promising results. Our approach complements existing dietary assessment methods to helpindividuals better manage their diet to prevent obesity and other diet-related diseases.IntroductionA dietary assessment is a comprehensive evaluationof a person’s food intake. It is a continuous processthat measures an individual’s food and nutrientconsumption history. An accurate dietary assessmentprovides valuable insight to an individual’s potentialhealth problems such as malnourishment and – morecommon in this modern age – obesity. Completedietary data are essential for individuals to constructpersonalized diet regimens to improve their eatinghabits for the prevention of such health issues.Various techniques exist to aid in dietary assessment.Photo diaries are popular among individuals trying tolose weight. Individuals photograph their meals andmake notes about each dish before eating.Unfortunately, a photo diary only shows what waseaten but not its nutritional value.Calorie counting software applications are alsopopular. Individuals look up particular dishes in thesoftware database to get an estimate of the nutritionalcontents. Though an extensive database providesgreater accuracy, navigating through it creates amental burden on the user. Did we eat fresh tomatoesor canned tomatoes? Such a meticulous approach canquickly become tedious and demotivates all but themost determined users since such choices make littlecaloric differences.Performing dietary assessment using digital photos offood is becoming popular. The ubiquity of mobilephones with cameras makes photography easy andaccessible. As of December 2009, there were morethan 285 million mobile phones in the US alone[1] .Leveraging this, in Japan, Metaboinfo’s Virtual Wife[2] has a team of nutritionists manually analyzingmobile phone photos of dishes to provide instantcalorie estimation for its users. Existing research[3,4]attempts to use automated computer visiontechniques to recognize food from their photos toassist in calorie estimation. Although promising, theperformance has been limited, performingoptimistically at 25 - 58% accuracy. Variations causedby many factors (e.g. distance and lightingconditions) make single photos of meals poorcandidates for reliable image processing. Our initialattempts at food recognition confirms this limitation,performing at less than 25% accuracy with photostaken from a bird’s-eye view.A natural next step is to use mobile phone videocameras to acquire better images for automatic imageprocessing. Videos provide a multi-perspective viewof the food, enabling us to more reliably determinewhat is on the plate. Shooting video of a dish is easierand no more time-consuming than shooting a singlephoto because the user is relieved from needing tocompose the "perfect" shot. The user simply shoots apanoramic video of the dish, and our software thenselects a number of candidate frames from the video.Furthermore, it is more robust against manyenvironmental factors that can negatively affect thequality of photos. Switching to a video-basedapproach improved our accuracy up to 95%. Thegrowing ubiquity of mobile phones with high-qualityvideo cameras makes our approach easily deployable.Our goal is a multistage approach toward atechnology-driven, accurate and reliable dietaryassessment. Such an approach will complementtechniques like calorie counting software by prefiltering irrelevant items that the user has to look up.The first stage of such an approach would be toreliably use videos to identify foods that areconsumed. Future stages would derive a series ofimage features that are salient indicators of thenutritional characteristics of a meal: a meal withmuch “image texture” (spatial variation in pixelintensities) may indicate fiber; a meal that “glistens”may indicate high fat content and, thus, highercalories than a leaner meal that “glistens” less. Suchfeatures could enable direct caloric estimation.

In this paper, we demonstrate the feasibility of thefirst stage: identifying dishes from videos to assist incalorie estimation.MethodologyOur approach is a pattern-matching technique. Firstwe build a database of training images of dishesannotated with their calories. When presented with anunknown image, our system finds the best-matchingimages from its training set. The annotated caloriesfrom those best-matching images are then used asestimates of the calories of that image.regular intervals to represent the dish from differentangular viewpoints. Multiple still-shot photos mayreplace these video frames, but the additionalphotographic information obtainable from a videoclip makes it easier to automatically determine theregion-of-interest (ROI) of the dish.Our ROI is an elliptical subsection of an image thatincludes as much of the food as possible but excludesmuch of the background and plate. Only the ROI isconsidered for image processing. We extract twokinds of information from the region: image featuresand color histogram.Capturing VideosWe captured videos of 20 different dishes at BevierCafé, a campus cafeteria managed by the Departmentof Food Science and Human Nutrition at theUniversity of Illinois. Bevier provided an idealenvironment for our work. The dishes are comparableto many home-cooked meals and those served atfamily restaurants. Existing work[3] uses computervision techniques on fast food meals but, to ourknowledge, our attempt is the first at analyzingtypical restaurant dishes which tend to have morevariety than fast food.We were given access to all of Bevier's recipes,which enabled us to calculate, by ingredient, thenutritional value of each dish. The video of the disheswas captured at 640 x 480 pixels; a typical resolutionavailable on most mobile phones video cameras suchas the Apple iPhone and Google Nexus.In our evaluations, the dishes were placed on ahorizontal turntable with a black tablecloth. Thevideo camera was mounted on a tripod and wasslanted at an angle to capture the entire dish. Inpractice, we envision that a user would rotate theplate manually while sitting at the table.Food items look very different from different angles.For example, a topdown view of a panini fails toreveal the contents sandwiched in between. On theother hand, a 360 degree off-axis view captures amore representative view of the dish. We rotated theturntable manually and captured a 360 degree view ofthe dish. Each video is about 20 seconds long. Figure1 shows a sample.Figure 1. Multiple video frames of a panini dish.Extracting Information From VideosOnce we have a video of a dish, we extract videoframes from it. A video frame is a particular snapshotof that dish in time. Video frames were extracted atFigure 2. Elliptical ROI (highlighted) and theextracted features.We evaluated three separate computer visionalgorithm for extracting image features: MSER[5],SURF[6] and STAR[7]. MSER is an algorithm forblob detection in images. Blobs are points and/orregions in the image that are either brighter ordarker than the surrounding. SURF and STAR arealgorithms for detecting interesting keypoints inimages. Interesting keypoints are distinctivelocations in the image such as corners, blobs and Tjunctions. The red circles in Figure 2 show anexample of image features that the SURF algorithmautomatically detects for the panini dish. Accordingto their respective authors, MSER, SURF and STARare robust algorithms; the image features detected arescale-invariant, rotation-invariant and partiallyinvariant to changes in illumination and geometricdistortion. The robustness of these algorithms isessential for our technique: because they are rotationinvariant, the algorithms locate the same imagefeatures on a piece of food even if it has been rotatedon the plate; because they are scale-invariant, thealgorithms locate the same image features even if thevideo is zoomed in or out.Our chosen image feature detectors only work on themonochrome channel of a video frame. Foods,however, are naturally rich in colors and thisinformation is crucial for proper recognition. To takeadvantage of colors, we encode the different colors

within the ROI using a color histogram employingthe HSV color model. The HSV color model is moreperceptually relevant to humans than the default RGBcolor model, used in most electronic devices.Building a Vocabulary from Image FeaturesWe use a natural language processing model knownas bag-of-words to automatically locate relevantimage features. First, our system aggregates all theimage feature from our collection of video frames.Then it performs k-means clustering on that data toextract 10,000 relevant features for our set of videoframes; these relevant features are the cluster centersfrom the k-means algorithm. These 10,000 featuresb e c o m e w o r d s i n o u r v o c a b u l a r y. T h o u g hconceptually equivalent to typical words in alanguage such as English, the words here arerepresented by 128-dimensional vectors. All videoframes in our collection are then described in termsof these words. Like natural languages, the morewords we have, the more descriptive we can be abouta particular video frame – at the expense of morecomputational resources.Existing work in computer vision shows that thisbag-of-words technique is scalable up to a millionimages easily[8]. Thus, it is possible to build up adatabase of food items for different restaurants and toprovide each mobile phone with such a database.Once the system has identified the 10,000-wordvocabulary, it uses Fast Library for ApproximateNearest Neighbors (FLANN)[9] to “fit” the featuresin each video frame into our vocabulary.After this step, each video frame is now encoded in acommon vocabulary i.e. a 10,000 bag-of-wordsvector. Figure 3 illustrates this encoding.Quantifying Similarities Between Different VideoFramesUltimately the goal of our technique is to be able totake a video frame of an unknown dish and determinewhich dish in our database matches it best. Moregenerally, given video frames frame1 and frame2, wewant to determine their similarities. Recall that wehave two kinds of information: image features andcolor histograms.Scoring Image FeaturesDescribing each video frame in a 10,000 bag-ofwords vector allows us to use the term frequencyinverse document frequency (tf-idf)[10] scoringtechnique from natural language processing.Term frequency counts the number of times a term(word) appears in a video frame i.e. the number in thebox in Figure 3 and is calculated as log10(1 wordfrequency). Inverse document frequency counts thenumber of documents (video frames) that contain thatparticular word and is calculated as log10(Total videoframes / Video frames with word). The tf-idf score fora word is the product of its tf score and idf scoreusing the formulas described.The similarity between two frames is determined bythe dot product of their unit bag-of-words vectorsafter tf-idf scoring.Scoring Color HistogramWe normalize each video frame’s histogram usingǁ‖ L1 norm ǁ‖ to account for the different sizes oftheir elliptical ROI. Then the correlation coefficientbetween the two color histograms is calculated.Weighted Score of Both Image Features and ColorHistogramSimilarity(frame1, frame2) 80% ImageFeatures score(frame1, frame2) 20% Histogram score(frame1, frame2)We place greater emphasis on the image features asthey are more robust and less affected byenvironmental variations. The 80%/20% heuristicwas found through performance tuning on an earliertesting set (not the one in the evaluation section).Experimental ResultsFigure 3. The original video frame encoded as abag-of-words vector. Each box corresponds to aparticular word determined from k-means clustering.The number in the box shows the frequency of thatword.We evaluated our technique on 20 different dishes –one salad, ten entrées, five side dishes and fourdesserts – covering the gamut of typical foods at arestaurant. The dishes and their calories are shown inFigure 4. Calories were calculated from the recipesusing a commercial food-ingredient reference[11],

Figure 4. Results of identifying 20 dishes using MSER, SURF and STAR algorithms for feature detection.and the USDA SR22 Nutrient Database[12]. Ourevaluation seeks to answer two research questions:comparably: MSER(19/20 95%); SURF(18/20 90%) and STAR(18/20 90%).1) How well does our technique perform inrecognizing dishes that we train it on?All three algorithms wrongly identified regular fries.They confused regular fries with steak fries sincethose two dishes are almost identical. On the otherhand, the algorithms correctly identified steak fries.This is because steak fries tend to be have moretexture and our system extracted more image featuresthat could be used to match for similarity. Eventhough the system confused these two dishes, thecaloric contents of both are very similar. This isacceptable since our ultimate goal is to estimate thecaloric content of meals as opposed to recognizing aspecific dish.We train our system on four different video frames ofa dish e.g. caesar salad and tested it on five videoframes of that type of dish e.g. another caesar salad.These five video frames represent a single dish takenfrom multiple angles. Our training set is availablefrom alorie GuruWe use the Similarity(frame1,frame2) functiondefined previously. A video frame is consideredcorrectly identified if the similarity function returnsone of the four training images that corresponds tothat dish as the top result. Otherwise, it is consideredwrongly identified.Because we have five video frames of a dish, we canuse a voting scheme. When our system correctlyidentifies three out of five of the video images, itvotes that the dish must indeed be that of the majorityvotes. This is a reasonable and effective method sinceit is not uncommon for a few video frames to beinconclusive while the other video frames all agreeon the same dish.Figure 4 shows our results for the three computervision algorithms (MSER, SURF and STAR) that weevaluated. Our voting scheme comes into play fordishes such as chicken on rice, pizza, and portabelloburger. For the chicken on rice dish, one video framewas wrongly identified in the MSER and SURFalgorithms but the other four video frames agree onthe same dish. Therefore, our system picks the dishthat the majority agrees on. Overall, our accuracy ispromising. The three algorithms performed2) Is our system capable of predicting a suitablematch for dishes that we did not train it on?Our system was trained only on the 20 dishes shownin Figure 4. We had two other salad dishes that thesystem was not trained on. We tested those twodishes on our system and it matched those two dishesas being most similar to the caesar salad in ourtraining set. We also tested our system on a chipotlechicken on ciabatta dish. Our system thought it wasmost similar to the chicken on rice dish. While thismatch was not exact, the caloric values of both dishesare similar i.e. around 800 calories. Figure 5 showsthe dishes.Our preliminary results suggest that given a largeenough training set, it might be possible to correctlymatch food items that the system has never seenbefore. More importantly, it also suggests that givenan extensive vocabulary it might be able to matchfoods based on image features that are salientindicators of caloric content and, possibly, othernutritional attributes. Work in determining if there

exists a canonical set of bag-of-words that can beused to describe major dishes still remains.AcknowledgementsFunding for this project was provided from theCIMIT Prize for Primary Healthcare. Our thanks toRichard Berlin, MD for his patient support from ourproject’s genesis in his (and Schatz’s) HealthcareInfrastructure course [13]. Special thanks to JeanLouis Ledent and Jill North Craft from Bevier Caféfor their invaluable input, and for making their staffand the Food Service Laboratory available to us.Additional thanks to Serena Schatz, Brett Daniel,Lucas Cook and Audrey Petty.ReferencesFigure 5. Matching unknown dishes to dishesin the training set.1.Conclusions and Future WorkWe described a novel technique using mobile phonevideo cameras to correctly identify different foods forcalorie estimation. We evaluated our technique on avariety of foods that are representative of typicalmeals. Using the voting scheme technique, oursystem is able to correctly identify many differentdishes. Even on the dishes that it fails to identify, it isable to match it to a relevant dish i.e. regular fries tosteak fries which has similar caloric content.Moreover, our evaluation suggests that given a largeenough training set and a richer vocabulary, wewould be able to match different kinds of food andmake reasonable estimations for the calories of dishesthat our system has not been trained on.The methodology we presented is a first step towardthe bigger goal of reliable and accurate dietaryassessment. Work remains to more throughly evaluatethis approach in the presence of illuminationinconsistencies and variations in the videos capturedby users. Nonetheless, our current result serves as abaseline to compare future approaches.Our goal is to develop a system to aid dietaryassessment for general health. Computer visiontechniques are just one small part of that system. Weintend to supplement our system with locationawareness and speech recognition techniques.Location-awareness, based on GPS information,identifies which restaurant the user is at, pruning thechoice of dishes that our system has to recognizebased on the restaurant’s current menu. Speechrecognition will use any additional cues that the usersmay provide to refine calorie estimates.The ubiquity of mobile phones and the scalability ofautomated techniques allow our approach to bedeployable to the general population to aid dietaryassessment: a user can estimate her caloriesconsumed through her mobile phone and relate it toher calories burned from her daily activities.2.3.4.5.6.7.8.9.10.11.12.13.CTIA: International Association for the WirelessTelecommunications Industry. Wireless QuickFacts; [cited July 14, 2010]. Available from:h t t p : / / w w w. c t i a . o rg / a d v o c a c y / r e s e a r c h /index.cfm/AID/10323.Virtual Wife from Metaboinfo Japan; [cited July14, 2010]. Available from:http://www.metaboinfo.com/okusama/.Chen M, Dhingra K, Wu W, Yang L, SukthankarR, Yang J. PFID: Pittsburgh Fast-food ImageDataset. In: Proceedings of IEEE ICIP; 2009.Mariappan A, Bosch M, Zhu F, Boushey CJ,Kerr DA, Ebert DS, et al. Personal dietaryassessment using mobile devices. vol. 7246.SPIE; 2009. p. 72460Z.Matas J, Chum O, Urban M, Pajdla T. Robustwide-baseline stereo from maximally stableextremal regions. Image and Vision Computing.2004;22(10):761 – 767. British Machine VisionComputing 2002.Bay H, Ess A, Tuytelaars T, Van Gool L.Speeded-Up Robust Features (SURF). Comput.Vis. Image Underst. 2008;110(3):346–359.Star Detector; [cited July 14, 2010]. Availablefrom:http://pr.willowgarage.com/wiki/Star Detector.Nister D, Stewenius H. Scalable Recognitionwith a Vocabulary Tree. In: CVPR ’06; 2006. p.2161–2168.Muja M, Lowe DG. Fast Approximate NearestNeighbors with Automatic AlgorithmConfiguration. In: VISSAPP (1). INSTICCPress; 2009. p. 331–340.Manning CD, Raghavan P, Schutze H.Introduction to Information Retrieval.Cambridge University Press; 2008.Natow AB, Heslin JA. The Most Complete FoodCounter. Pocket; 2006.USDA National Nutrient Database for StandardReference, SR22 dataset; [cited July 14, 2010].Available .Schatz BR, Berlin RB. Healthcare Infrastructure:Health Systems for Populations and Individuals.Springer; Forthcoming 2011.

Toward Dietary Assessment via Mobile Phone Video Cameras Nicholas Chen, MS 1, Yun Young Lee, BS , Maurice Rabb, MS , Bruce Schatz, PhD2 1Department of Computer Science, University of Illinois, Urbana, IL 61801 USA; 2Department of Medical Information Science, University of Illinois, Urbana, IL 61801 USA Abstract Reliable dietary assessment is a challenging yet es-

Related Documents:

Dec 01, 2016 · 4.4.2 Summary of the major features of the different direct dietary assessment methods 84 4.4.3 Case studies on selection of a dietary assessment method 90 5. Key messages and the way forward in dietary assessment 93 5.1 Key messages 93 5.2 the way forward 95

Texas Department of Agriculture — November 2011 Accommodating Children with Special Dietary Needs 13.a Accommodating Children With Special Dietary Needs—Table of Contents Special Dietary Needs 13.1 Definitions of Disability and of Other Special Dietary Needs 13.2 Individuals With Disabilities Education Act 13.2 Physician’s Statement for Children With Disabilities

Dietary Guidelines Advisory Committee. 2010. Report of the Dietary Guidelines Advisory Committee on the Dietary Guidelines for Americans, 2010, to the Secretary of Agriculture and the Secretary of Health and Human Services. U.S. Department of Agriculture

U.S. Dietary Guidelines Low Awareness Barriers to Adoption and Use High Skepticism About 50 % of Americans are aware of the Dietary Guidelines*; just 1 in 10 know what they say* 33% call the Dietary Guidelines ―complicated‖* Just 18% say the Dietary Guidelines influen

(HHS/USDA) 2005 Dietary Guidelines for Americans Dietary Guidelines for Americans (USDA/HHS) 2010 McGovern Report - Dietary Goals (6) 1977 Surgeon General's Report on Health Promotion 1979 Dietary Guidelines for Americans (USDA/HH

Background: Healthy and sustainable dietary practices offer a possible solution to competing tensions between health and environmental sustainability, particularly as global food systems transition. To encourage such dietary practices, it is imperative to understand existing dietary practices and factors influencing these dietary practices. The

The Dietary Patterns chapter reflects evidence the Committee considered on the relationship between dietary patterns and 8 broad health outcomes. Except for all-cause mortality and sarcopenia, these outcomes also were addressed by the 2015 Committee. Because dietary patterns encompass diverse foods and beverages, this chapter complements

ADVANCED BOOKKEEPING KAPLAN PUBLISHING Introduction When a capital asset or non-current asset is disposed of there are a variety of accounting calculations and entries that need to be made. Firstly, the asset being disposed of must be removed from the accounting records as it is no longer controlled. In most cases the asset will be disposed of for either more or less than its carrying value .