Jason Bell - Lagout

2y ago
24 Views
2 Downloads
7.38 MB
407 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Lee Brooke
Transcription

Machine LearningHands-On for Developers andTechnical ProfessionalsJason Bellffirs.indd 10:2:39:AM 10/06/2014Page i

Machine Learning: Hands-On for Developers and Technical ProfessionalsPublished byJohn Wiley & Sons, Inc.10475 Crosspoint BoulevardIndianapolis, IN 46256www.wiley.comCopyright 2015 by John Wiley & Sons, Inc., Indianapolis, IndianaPublished simultaneously in CanadaISBN: 978-1-118-88906-0ISBN: 978-1-118-88939-8 (ebk)ISBN: 978-1-118-88949-7 (ebk)Manufactured in the United States of America10 9 8 7 6 5 4 3 2 1No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by anymeans, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, orauthorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive,Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressedto the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201)748-6008, or online at http://www.wiley.com/go/permissions.Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties withrespect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, includingwithout limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales orpromotional materials. The advice and strategies contained herein may not be suitable for every situation. This workis sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professionalservices. If professional assistance is required, the services of a competent professional person should be sought.Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization orWeb site is referred to in this work as a citation and/or a potential source of further information does not mean thatthe author or the publisher endorses the information the organization or website may provide or recommendationsit may make. Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read.For general information on our other products and services please contact our Customer Care Department within theUnited States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included withstandard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to mediasuch as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.Library of Congress Control Number: 2014946682Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/orits affi liates, in the United States and other countries, and may not be used without written permission. All othertrademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any productor vendor mentioned in this book.ffirs.indd 10:2:39:AM 10/06/2014Page ii

To Wendy and Clarissa.ffirs.indd 10:2:39:AM 10/06/2014Page iii

CreditsExecutive EditorCarol LongBusiness ManagerAmy KniesProject EditorCharlotte KughenProfessional Technology &Strategy DirectorBarry PruettTechnical EditorMitchell WyleAssociate PublisherJim MinatelProduction EditorChristine MugnoloProject Coordinator, CoverPatrick RedmondCopy EditorKatherine BurtProofreaderNancy CarrascoProduction ManagerKathleen WisorManager of Content Developmentand AssemblyMary Beth WakefieldDirector of Community MarketingDavid MayhewMarketing ManagerCarrie Sherrillivffirs.indd 10:2:39:AM 10/06/2014Page ivIndexerJohnna DinseCover DesignerWileyCover Image iStock.com/VLADGRIN

About the AuthorJason Bell has been working with point-of-sale and customer-loyalty data since2002, and he has been involved in software development for more than 25 years.He is founder of Datasentiment, a UK business that helps companies worldwidewith data acquisition, processing, and insight.vffirs.indd 10:2:39:AM 10/06/2014Page v

AcknowledgmentsDuring the autumn of 2013, I was presented with some interesting options: eitherdo a research-based PhD or co-author a book on machine learning. One wouldtake six years and the other would take seven to eight months. Because of thespeed the data industry was, and still is, progressing, the idea of the book wasmore appealing because I would be able to get something out while it was stillfresh and relevant, and that was more important to me.I say “co-author” because the original plan was to write a machine learningbook with Aidan Rogers. Due to circumstances beyond his control he had topull out. With Aidan’s blessing, I continued under my own steam, and for thatopportunity I can’t thank him enough for his grace, encouragement, and support in that decision.Many thanks goes to Wiley, especially Executive Editor, Carol Long, forletting me tweak things here and there with the original concept and bring it toa more practical level than a theoretical one; Project Editor, Charlotte Kughen,who kept me on the straight and narrow when there were times I didn’t makesense; and Mitchell Wyle for reviewing the technical side of things. Also bigthanks to the Wiley family as a whole for looking after me with this project.Over the years I’ve met and worked with some incredible people, so in noparticular order here goes: Garrett Murphy, Clare Conway, Colin Mitchell, DavidCrozier, Edd Dumbill, Matt Biddulph, Jim Weber, Tara Simpson, Marty Neill,John Girvin, Greg O’Hanlon, Clare Rowland, Tim Spear, Ronan Cunningham,Tom Grey, Stevie Morrow, Steve Orr, Kevin Parker, John Reid, James Blundell,Mary McKenna, Mark Nagurski, Alan Hook, Jon Brookes, Conal Loughrey,Paul Graham, Frankie Colclough, and countless others (whom I will bekicking myself that I’ve forgotten) for all the meetings, the chats, the ideas, andthe collaborations.viiffirs.indd 10:2:39:AM 10/06/2014Page vii

viiiAcknowledgmentsThanks to Tim Brundle, Matt Johnson, and Alan Thorburn for their supportand for introducing me to the people who would inspire thoughts that wouldspur me on to bigger challenges with data. An enormous thank you to ThomasSpinks for having faith in me, without him there wouldn’t have been a careerin computing.In relation to the challenge of writing a book I have to thank Ben Hammersley,Alistair Croll, Alasdair Allan, and John Foreman for their advice and supportthroughout the whole process.I also must thank my dear friend, Colin McHale, who, on one late eveningwhile waiting for the soccer data to refresh, taught me Perl on the back of aKitKat wrapper, thus kick-starting a journey of software development.Finally, to my wife, Wendy, and my daughter, Clarissa, for absolutely everythingand encouraging me to do this book to the best of my nerdy ability. I couldn’thave done it without you both. And to the Bell family—George, Maggie and mysister Fern—who have encouraged my computing journey from a very early age.During the course of writing this book, musical enlightenment was broughtto me by St. Vincent, Trey Gunn, Suzanne Vega, Tackhead, Peter Gabriel, DougWimbish, King Crimson, and Level 42.ffirs.indd 10:2:39:AM 10/06/2014Page viii

ContentsIntroductionChapter 1xixWhat Is Machine Learning?History of Machine Learning11Alan TuringArthur SamuelTom M. MitchellSummary Definition1222Algorithm Types for Machine Learning3Supervised LearningUnsupervised Learning33The Human TouchUses for Machine Learning44SoftwareStock TradingRoboticsMedicine and HealthcareAdvertisingRetail and E-CommerceGaming AnalyticsThe Internet of Things45666789Languages for Machine 1ixftoc.indd 09:58:19:AM 10/06/2014Page ix

xContentsSoftware Used in This Book11Checking the Java VersionWeka ToolkitMahoutSpringXDHadoopUsing an IDE111212131314Data RepositoriesUC Irvine Machine Learning RepositoryInfochimpsKaggleChapter 214141415Summary15Planning for Machine LearningThe Machine Learning CycleIt All Starts with a QuestionI Don’t Have Data!17171819Starting LocalCompetitionsOne Solution Fits All?Defining the roductionBuilding a Data TeamMathematics and StatisticsProgrammingGraphic DesignDomain KnowledgeData ProcessingUsing Your ComputerA Cluster of MachinesCloud-Based ServicesData l DiscsCloud-Based StorageData Privacy252525Cultural NormsGenerational ExpectationsThe Anonymity of User DataDon’t Cross “The Creepy Line”Data Quality and CleaningPresence Checksftoc.indd 09:58:19:AM 10/06/20141919Page x252626272828

ContentsType ChecksLength ChecksRange ChecksFormat ChecksThe Britney DilemmaWhat’s in a Country Name?Dates and TimesFinal Thoughts on Data Cleaning2929303030333535Thinking about Input Data36Raw TextComma Separated VariablesJSONYAMLXMLSpreadsheetsDatabasesChapter 336363739394041Thinking about Output DataDon’t Be Afraid to ExperimentSummary424243Working with Decision TreesThe Basics of Decision Trees4545Uses for Decision TreesAdvantages of Decision TreesLimitations of Decision TreesDifferent Algorithm TypesHow Decision Trees Work4546464748Decision Trees in Weka53The RequirementTraining DataUsing Weka to Create a Decision TreeCreating Java Code from the ClassificationTesting the Classifier CodeThinking about Future IterationsChapter 4535355606466Summary67Bayesian NetworksPilots to PaperclipsA Little Graph TheoryA Little Probability Theory69697072Coin FlipsConditional ProbabilityWinning the Lottery727273Bayes’ TheoremHow Bayesian Networks Work7375Assigning ProbabilitiesCalculating Results7677ftoc.indd 09:58:19:AM 10/06/2014Page xixi

xiiContentsChapter 5Node CountsUsing Domain ExpertsA Bayesian Network Walkthrough787879Java APIs for Bayesian NetworksPlanning the NetworkCoding Up the Network797981Summary90Artificial Neural NetworksWhat Is a Neural Network?Artificial Neural Network Uses919192High-Frequency TradingCredit ApplicationsData Center ManagementRoboticsMedical MonitoringBreaking Down the Artificial Neural NetworkPerceptronsActivation FunctionsMultilayer PerceptronsBack PropagationData Preparation for Artificial Neural NetworksArtificial Neural Networks with WekaGenerating a DatasetLoading the Data into WekaConfiguring the Multilayer PerceptronTraining the NetworkAltering the NetworkIncreasing the Test Data SizeImplementing a Neural Network in JavaCreate the ProjectThe CodeConverting from CSV to ArffRunning the Neural NetworkChapter 111114114Summary115Association Rules LearningWhere Is Association Rules Learning Used?117117Web Usage MiningBeer and DiapersHow Association Rules Learning WorksSupportConfidenceLiftConvictionDefining the Processftoc.indd 09:58:19:AM 10/06/2014Page xii118118119121121122122122

ContentsAlgorithms123AprioriFP-GrowthChapter 7123124Mining the Baskets—A Walkthrough124Downloading the Raw DataSetting Up the Project in EclipseSetting Up the Items Data FileSetting Up the DataRunning MahoutInspecting the ResultsPutting It All TogetherFurther rt Vector MachinesWhat Is a Support Vector Machine?Where Are Support Vector Machines Used?The Basic Classification Principles139139140140Binary and Multiclass ClassificationLinear ClassifiersConfidenceMaximizing and Minimizing to Find the LineHow Support Vector Machines Approach ClassificationUsing Linear ClassificationUsing Non-Linear Classification144144146Using Support Vector Machines in WekaInstalling LibSVMA Classification WalkthroughImplementing LibSVM with JavaChapter 8140142143143147147148154Summary159ClusteringWhat Is Clustering?Where Is Clustering Used?161161162The InternetBusiness and RetailLaw EnforcementComputing162163163163Clustering Models164How the K-Means WorksCalculating the Number of Clusters in a DatasetK-Means Clustering with Weka164166168Preparing the DataThe Workbench MethodThe Command-Line MethodThe Coded Method168169174178Summary186ftoc.indd 09:58:19:AM 10/06/2014Page xiiixiii

xivContentsChapter 9Machine Learning in Real Time with Spring XDCapturing the Firehose of Data187187Considerations of Using Data in Real TimePotential Uses for a Real-Time System188188Using Spring XDSpring XD StreamsInput Sources, Sinks, and ProcessorsLearning from Twitter DataThe Development PlanConfiguring the Twitter API Developer ApplicationConfiguring Spring XDStarting the Spring XD ServerCreating Sample DataThe Spring XD ShellStreams 101Spring XD and TwitterSetting the Twitter CredentialsCreating Your First Twitter StreamWhere to Go from HereIntroducing ProcessorsHow Processors Work within a StreamCreating Your Own ProcessorReal-Time Sentiment AnalysisHow the Basic Analysis WorksCreating a Sentiment ProcessorSpring XD TapsChapter 206207215215217221Summary222Machine Learning as a Batch ProcessIs It Big Data?Considerations for Batch Processing Data223223224Volume and FrequencyHow Much Data?Which Process Method?224225225Practical Examples of Batch Processes225HadoopSqoopPigMahoutCloud-Based Elastic Map ReduceA Note about the Walkthroughs225226226226226227Using the Hadoop FrameworkThe Hadoop ArchitectureSetting Up a Single-Node Clusterftoc.indd 09:58:19:AM 10/06/2014Page xiv227227229

ContentsHow MapReduce WorksMining the Hashtags233234Hadoop Support in Spring XDObjectives for This WalkthroughWhat’s a Hashtag?Creating the MapReduce ClassesPerforming ETL on Existing DataProduct Recommendation with MahoutMining Sales Data256Welcome to My Coffee Shop!Going Small ScaleWriting the Core MethodsUsing Hadoop and MapReduceUsing Pig to Mine Sales DataChapter 11235235235236247250257258258260263Scheduling Batch JobsSummary273274Apache SparkSpark: A Hadoop Replacement?Java, Scala, or Python?Scala Crash Course275275276276Installing ScalaPackagesData TypesClassesCalling FunctionsOperatorsControl Structures276277277278278279279Downloading and Installing SparkA Quick Intro to Spark280280Starting the ShellData SourcesTesting SparkSpark Monitor281282282284Comparing Hadoop MapReduce to SparkWriting Standalone Programs with SparkSpark Programs in ScalaInstalling SBTSpark Programs in JavaSpark Program Summary285288288288291295Spark SQL295Basic ConceptsUsing SparkSQL with RDDs295296Spark Streaming305Basic ConceptsCreating Your First Stream with ScalaCreating Your First Stream with Javaftoc.indd 09:58:19:AM 10/06/2014305306309Page xvxv

xviContentsMLib: The Machine Learning LibraryDependenciesDecision TreesClusteringChapter 12311311312313Summary313Machine Learning with RInstalling R315315Mac OSXWindowsLinux315316316Your First RunInstalling R-StudioThe R Basics316317318Variables and VectorsMatricesListsData FramesInstalling PackagesLoading in DataPlotting Data318319320321322323324Simple StatisticsSimple Linear Regression327329Creating the DataThe Initial GraphRegression with the Linear ModelMaking a PredictionBasic Sentiment AnalysisFunctions to Load in Word ListsWriting a Function to Score SentimentTesting the FunctionApriori Association RulesInstalling the ARules PackageThe Training DataImporting the Transaction DataRunning the Apriori AlgorithmInspecting the ResultsAccessing R from JavaInstalling the rJava PackageYour First Java Code in RCalling R from Java ProgramsSetting Up an Eclipse ProjectCreating the Java/R ClassRunning the ExampleExtending Your R ImplementationsR and Hadoopftoc.indd 09:58:19:AM 10/06/2014Page 7337338338339340342342

ContentsThe RHadoop ProjectA Sample Map Reduce Job in RHadoopConnecting to Social Media with RSummary342343345347Appendix A SpringXD Quick StartInstalling ManuallyStarting SpringXDCreating a StreamAdding a Twitter Application Key349349349350350Appendix B Hadoop 1.x Quick StartDownloading and Installing HadoopFormatting the HDFS FilesystemStarting and Stopping HadoopProcess List of a Basic Job351351352353353Appendix C Useful Unix CommandsUsing Sample DataShowing the Contents: cat, more, and less355355356Example CommandExpected Output356356Filtering Content: grep357Example Command for Finding TextExample OutputSorting Data: sort357357358Example Command for Basic SortingExample OutputFinding Unique Occurrences: uniqShowing the Top of a File: headCounting Words: wcLocating Anything: findCombining Commands and Redirecting OutputPicking a Text EditorColon Frenzy: Vi and dix D Further ReadingMachine LearningStatisticsBig Data and Data ScienceHadoopVisualizationMaking DecisionsDatasetsBlogsUseful WebsitesThe Tools of the .indd 09:58:19:AM 10/06/2014Page xviixvii

IntroductionData, data, data. You can’t have escaped the headlines, reports, white papers, andeven television coverage on the rise of Big Data and data science. The push is tolearn, synthesize, and act upon all the data that comes out of social media, ourphones, our hardware devices (otherwise known as “The Internet of Things”),sensors, and basically anything that can generate data.The emphasis of most of this marketing is about data volumes and the velocityat which it arrives. Prophets of the data flood tell us we can’t process this datafast enough, and the marketing machine will continue to hawk the services weneed to buy to achieve all such speed. To some degree they are right, but it’sworth stopping for a second and having a proper think about the task at hand.Data mining and machine learning have been around for a number of yearsalready, and the huge media push surrounding Big Data has to do with datavolume. When you look at it closely, the machine learning algorithms that arebeing applied aren’t any different from what they were years ago; what is newis how they are applied at scale. When you look at the number of organizations that are creating the data, it’s really, in my opinion, the minority. Google,Facebook, Twitter, Netflix, and a small handful of others are the ones gettingthe majority of mentions in the headlines with a mixture of algorithmic learning and tools that enable them to scale. So, the real question you should ask is,“How does all this apply to the rest of us?”I admit there will be times in this book when I look at the Big Data side ofmachine learning—it’s a subject I can’t ignore—but it’s only a small factor inthe overall picture of how to get insight from the available data. It is importantto remember that I am talking about tools, and the key is figuring out whichtools are right for the job you are trying to complete. Although the “tech press”xixflast.indd10:2:51:AM 10/06/2014Page xix

xxIntroductionmight want Hadoop stories, Hadoop is not always the right tool to use for thetask you are trying to complete.Aims of This BookThis book is about machine learning and not about Big Data. It’s about the various techniques used to gain insight from your data. By the end of the book,you will have seen how various methods of machine learning work, and youwill also have had some practical explanations on how the code is put together,leaving you with a good idea of how you could apply the right machine learningtechniques to your own problems.There’s no right or wrong way to use this book. You can start at the beginning and work your way through, or you can just dip in and out of the partsyou need to know at the time you

Machine Learning Hands-On for Developers and Technical Professionals. ffi rs.indd 10:2:39:AM 10/06/2014 Page ii Machine Learning: Hands-On for Developers and Technical Professionals Published by John Wiley & Sons, Inc. 10475

Related Documents:

Baby shower invitation, Mr. & Mrs. Lloyd Flowers, in honour of Richard Bell, Jr, Friday, June 2 [1950?] Birth announcement card, Richard (Ricky) Bell, November 8, 1949 to parents Richard & Iris Bell 3 telegrams – August 22-23, 1945 addressed to Richard Bell, Mrs. Charles Bell, Rev. D

The Universal Helicopters Bell Certified Training Facility, under the watchful eye of Bell's own training personnel, takes pride in offering the most highly advanced and exceptional helicopter flight training in the Bell 206B Jet Ranger helicopter. Like Bell, our primary objective is to supply the Bell/UHI customers

Campbell Helicopters Ltd. - Bell 212 Type rating Workbook 2017-001 Page 2 of 19 MODULE 1 - INTRODUCTION TO THE BELL 212 Read: Bell 212 Transition Manual Section 2 - General Description Section 5 - Airframe Bell 212 Flight Manual General Information pages i-iv Bell 212 Manufacturers Data Section 1 - Systems Description Review Questions: 1.

Fig1.MICROCONTROLLER BASED BELL SYSTEM This Project takes over the task of Ringing of the Bell in Colleges. It replaces the Manual Switching of the Bell in the College. It has an Inbuilt Real Time Clock (DS1307 /DS 12c887) which tracks over the Real Time. When this time equals to the Bell Ringing time, then the Relay for the Bell is switched on.

About the JASON Project The JASON Multimedia Science Curriculum (JMSC), also known as the JASON Project, is developed by the JASON Foundation for Education and currently serves approximately 25,000 teachers and one million students, the majority of whom are in grades four through nine. A multimedia, inter-

Regulated Bell Service or a feature of a Regulated Bell Service (sometimes referred to as “forbearance”), then Bell will continue to honour the terms of the Tariff as though your Bell Service were still regulated until your term (oth

SPECIFICATIONS JANUAR 2020 2020 ell Textron Inc. Specications subject to change without notice. 1 BELL 429 The Bell 429. Bell 429 The Bell 429 delivers exceptional speed, range, hover performance and enhan

SPECIFICATIONS JUL 2019 2019 ell Textron Inc. Specications subject to change without notice. 1 412 PI The Bell 412EPI Glass Cockpit Bell 412EPI ELECTRONIC ENGINE CONTROL AND GLASS COCKPIT The Bell 412EPI is an upgrade to the basic Bell