Mark Needham And Amy E. Hodler - Neo4j Graph Platform

3y ago
61 Views
2 Downloads
9.79 MB
257 Pages
Last View : 23d ago
Last Download : 3m ago
Upload by : Julia Hutchens
Transcription

Graph AlgorithmsPractical Examples inApache Spark and Neo4jMark Needham and Amy E. HodlerBeijingBoston Farnham SebastopolTokyo

Graph Algorithmsby Mark Needham and Amy E. HodlerCopyright 2019 Amy Hodler and Mark Needham. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://oreilly.com). For more information, contact our corporate/institutionalsales department: 800-998-9938 or corporate@oreilly.com.Acquisition Editor: Jonathan HassellEditor: Jeff BleielProduction Editor: Deborah BakerCopy Editor: Tracy BrownProofreader: Rachel HeadMay 2019:Indexer: Judy McConvilleInterior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca DemarestFirst EditionRevision History for the First Edition2019-04-15: First ReleaseSee http://oreilly.com/catalog/errata.csp?isbn 9781492047681 for release details.The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Graph Algorithms, the cover image of aEuropean garden spider, and related trade dress are trademarks of O’Reilly Media, Inc.While the publisher and the authors have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the authors disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.This work is part of a collaboration between O’Reilly and Neo4j. See our statement of editorial independ‐ence.978-1-492-05781-9[LSI]

Table of ContentsPreface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixForeword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1What Are Graphs?What Are Graph Analytics and Algorithms?Graph Processing, Databases, Queries, and AlgorithmsOLTP and OLAPWhy Should We Care About Graph Algorithms?Graph Analytics Use CasesConclusion2367812142. Graph Theory and Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15TerminologyGraph Types and StructuresRandom, Small-World, Scale-Free StructuresFlavors of GraphsConnected Versus Disconnected GraphsUnweighted Graphs Versus Weighted GraphsUndirected Graphs Versus Directed GraphsAcyclic Graphs Versus Cyclic GraphsSparse Graphs Versus Dense GraphsMonopartite, Bipartite, and k-Partite GraphsTypes of Graph AlgorithmsPathfindingCentralityCommunity Detection1516171819192122232427272727iii

Summary283. Graph Platforms and Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Graph Platform and Processing ConsiderationsPlatform ConsiderationsProcessing ConsiderationsRepresentative PlatformsSelecting Our PlatformApache SparkNeo4j Graph PlatformSummary29293031313234374. Pathfinding and Graph Search Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Example Data: The Transport GraphImporting the Data into Apache SparkImporting the Data into Neo4jBreadth First SearchBreadth First Search with Apache SparkDepth First SearchShortest PathWhen Should I Use Shortest Path?Shortest Path with Neo4jShortest Path (Weighted) with Neo4jShortest Path (Weighted) with Apache SparkShortest Path Variation: A*Shortest Path Variation: Yen’s k-Shortest PathsAll Pairs Shortest PathA Closer Look at All Pairs Shortest PathWhen Should I Use All Pairs Shortest Path?All Pairs Shortest Path with Apache SparkAll Pairs Shortest Path with Neo4jSingle Source Shortest PathWhen Should I Use Single Source Shortest Path?Single Source Shortest Path with Apache SparkSingle Source Shortest Path with Neo4jMinimum Spanning TreeWhen Should I Use Minimum Spanning Tree?Minimum Spanning Tree with Neo4jRandom WalkWhen Should I Use Random Walk?Random Walk with Neo4jSummaryiv Table of 6970717273747475

5. Centrality Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Example Graph Data: The Social GraphImporting the Data into Apache SparkImporting the Data into Neo4jDegree CentralityReachWhen Should I Use Degree Centrality?Degree Centrality with Apache SparkCloseness CentralityWhen Should I Use Closeness Centrality?Closeness Centrality with Apache SparkCloseness Centrality with Neo4jCloseness Centrality Variation: Wasserman and FaustCloseness Centrality Variation: Harmonic CentralityBetweenness CentralityWhen Should I Use Betweenness Centrality?Betweenness Centrality with Neo4jBetweenness Centrality Variation: Randomized-Approximate BrandesPageRankInfluenceThe PageRank FormulaIteration, Random Surfers, and Rank SinksWhen Should I Use PageRank?PageRank with Apache SparkPageRank with Neo4jPageRank Variation: Personalized 9991001021031031051071086. Community Detection Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Example Graph Data: The Software Dependency GraphImporting the Data into Apache SparkImporting the Data into Neo4jTriangle Count and Clustering CoefficientLocal Clustering CoefficientGlobal Clustering CoefficientWhen Should I Use Triangle Count and Clustering Coefficient?Triangle Count with Apache SparkTriangles with Neo4jLocal Clustering Coefficient with Neo4jStrongly Connected ComponentsWhen Should I Use Strongly Connected Components?Strongly Connected Components with Apache SparkTable of Contents112114114114115116116117117118119120120 v

Strongly Connected Components with Neo4jConnected ComponentsWhen Should I Use Connected Components?Connected Components with Apache SparkConnected Components with Neo4jLabel PropagationSemi-Supervised Learning and Seed LabelsWhen Should I Use Label Propagation?Label Propagation with Apache SparkLabel Propagation with Neo4jLouvain ModularityWhen Should I Use Louvain?Louvain with Neo4jValidating 31371381431437. Graph Algorithms in Practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Analyzing Yelp Data with Neo4jYelp Social NetworkData ImportGraph ModelA Quick Overview of the Yelp DataTrip Planning AppTravel Business ConsultingFinding Similar CategoriesAnalyzing Airline Flight Data with Apache SparkExploratory AnalysisPopular AirportsDelays from ORDBad Day at SFOInterconnected Airports by 1721741818. Using Graph Algorithms to Enhance Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . 183Machine Learning and the Importance of ContextGraphs, Context, and AccuracyConnected Feature Extraction and SelectionGraphy FeaturesGraph Algorithm FeaturesGraphs and Machine Learning in Practice: Link PredictionTools and DataImporting the Data into Neo4jvi Table of Contents183184185187188190190192

The Coauthorship GraphCreating Balanced Training and Testing DatasetsHow We Predict Missing LinksCreating a Machine Learning PipelinePredicting Links: Basic Graph FeaturesPredicting Links: Triangles and the Clustering CoefficientPredicting Links: Community DetectionSummaryWrapping Things Up193194199200201214218224224A. Additional Information and Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231Table of Contents vii

PrefaceThe world is driven by connections—from financial and communication systems tosocial and biological processes. Revealing the meaning behind these connectionsdrives breakthroughs across industries in areas such as identifying fraud rings andoptimizing recommendations to evaluating the strength of a group and predictingcascading failures.As connectedness continues to accelerate, it’s not surprising that interest in graphalgorithms has exploded because they are based on mathematics explicitly developedto gain insights from the relationships between data. Graph analytics can uncover theworkings of intricate systems and networks at massive scales—for any organization.We are passionate about the utility and importance of graph analytics as well as thejoy of uncovering the inner workings of complex scenarios. Until recently, adoptinggraph analytics required significant expertise and determination, because tools andintegrations were difficult and few knew how to apply graph algorithms to theirquandaries. It is our goal to help change this. We wrote this book to help organiza‐tions better leverage graph analytics so that they can make new discoveries anddevelop intelligent solutions faster.What’s in This BookThis book is a practical guide to getting started with graph algorithms for developersand data scientists who have experience using Apache Spark or Neo4j. Although ouralgorithm examples utilize the Spark and Neo4j platforms, this book will also be help‐ful for understanding more general graph concepts, regardless of your choice ofgraph technologies.The first two chapters provide an introduction to graph analytics, algorithms, andtheory. The third chapter briefly covers the platforms used in this book before wedive into three chapters focusing on classic graph algorithms: pathfinding, centrality,and community detection. We wrap up the book with two chapters showing howix

graph algorithms are used within workflows: one for general analysis and one formachine learning.At the beginning of each category of algorithms, there is a reference table to help youquickly jump to the relevant algorithm. For each algorithm, you’ll find: An explanation of what the algorithm does Use cases for the algorithm and references to where you can learn more Example code providing concrete ways to use the algorithm in Spark, Neo4j, orbothConventions Used in This BookThe following typographical conventions are used in this book:ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.Constant widthUsed for program listings, as well as within paragraphs to refer to program ele‐ments such as variable or function names, databases, data types, environmentvariables, statements, and keywords.Constant width boldShows commands or other text that should be typed literally by the user.Constant width italicShows text that should be replaced with user-supplied values or by values deter‐mined by context.This element signifies a tip or suggestion.This element signifies a general note.x Preface

This element indicates a warning or caution.Using Code ExamplesSupplemental material (code examples, exercises, etc.) is available for download athttps://bit.ly/2FPgGVV.This book is here to help you get your job done. In general, if example code is offeredwith this book, you may use it in your programs and documentation. You do notneed to contact us for permission unless you’re reproducing a significant portion ofthe code. For example, writing a program that uses several chunks of code from thisbook does not require permission. Selling or distributing a CD-ROM of examplesfrom O’Reilly books does require permission. Answering a question by citing thisbook and quoting example code does not require permission. Incorporating a signifi‐cant amount of example code from this book into your product’s documentation doesrequire permission.We appreciate, but do not require, attribution. An attribution usually includes thetitle, author, publisher, and ISBN. For example: “Graph Algorithms by Amy E. Hodlerand Mark Needham (O’Reilly). Copyright 2019 Amy E. Hodler and Mark Needham,978-1-492-05781-9.”If you feel your use of code examples falls outside fair use or the permission givenabove, feel free to contact us at permissions@oreilly.com.O’Reilly Online LearningFor almost 40 years, O’Reilly has provided technology andbusiness training, knowledge, and insight to help companiessucceed.Our unique network of experts and innovators share their knowledge and expertisethrough books, articles, conferences, and our online learning platform. O’Reilly’sonline learning platform gives you on-demand access to live training courses, indepth learning paths, interactive coding environments, and a vast collection of textand video from O’Reilly and 200 other publishers. For more information, pleasevisit http://oreilly.com.Preface xi

How to Contact UsPlease address comments and questions concerning this book to the publisher:O’Reilly Media, Inc.1005 Gravenstein Highway NorthSebastopol, CA 95472800-998-9938 (in the United States or Canada)707-829-0515 (international or local)707-829-0104 (fax)We have a web page for this book, where we list errata, examples, and any additionalinformation. You can access this page at http://bit.ly/graph-algorithms.To comment or ask technical questions about this book, send email to bookques‐tions@oreilly.com.For more information about our books, courses, conferences, and news, see our web‐site at http://www.oreilly.com.Find us on Facebook: http://facebook.com/oreillyFollow us on Twitter: http://twitter.com/oreillymediaWatch us on YouTube: We’ve thoroughly enjoyed putting together the material for this book and thank allthose who assisted. We’d especially like to thank Michael Hunger for his guidance, JimWebber for his invaluable edits, and Tomaz Bratanic for his keen research. Finally, wegreatly appreciate Yelp permitting us to use its rich dataset for powerful examples.xii Preface

ForewordWhat do the following things all have in common: marketing attribution analysis,anti-money laundering (AML) analysis, customer journey modeling, safety incidentcausal factor analysis, literature-based discovery, fraud network detection, internetsearch node analysis, map application creation, disease cluster analysis, and analyzingthe performance of a William Shakespeare play. As you might have guessed, whatthese all have in common is the use of graphs, proving that Shakespeare was rightwhen he declared, “All the world’s a graph!”Okay, the Bard of Avon did not actually write graph in that sentence, he wrote stage.However, notice that the examples listed above all involve entities and the relation‐ships between them, including both direct and indirect (transitive) relationships.Entities are the nodes in the graph—these can be people, events, objects, concepts, orplaces. The relationships between the nodes are the edges in the graph. Therefore,isn’t the very essence of a Shakespearean play the active portrayal of entities (thenodes) and their relationships (the edges)? Consequently, maybe Shakespeare couldhave written graph in his famous declaration.What makes graph algorithms and graph databases so interesting and powerful isn’tthe simple relationship between two entities, with A being related to B. After all, thestandard relational model of databases instantiated these types of relationships in itsfoundation decades ago, in the entity relationship diagram (ERD). What makesgraphs so remarkably important are directional relationships and transitive relation‐ships. In directional relationships, A may cause B, but not the opposite. In transitiverelationships, A can be directly related to B and B can be directly related to C, while Ais not directly related to C, so that consequently A is transitively related to C.With these transitivity relationships—particularly when they are numerous anddiverse, with many possible relationship/network patterns and degrees of separationbetween the entities—the graph model uncovers relationships between entities thatotherwise may seem disconnected or unrelated, and are undetected by a relationalxiii

database. Hence, the graph model can be applied productively and effectively in manynetwork analysis use cases.Consider this marketing attribution use case: person A sees the marketing campaign;person A talks about it on social media; person B is connected to person A and seesthe comment; and, subsequently, person B buys the product. From the marketingcampaign manager’s perspective, the standard relational model fails to identify theattribution, since B did not see the campaign and A did not respond to the campaign.The campaign looks like a failure, but its actual success (and positive ROI) is discov‐ered by the graph analytics algorithm through the transitive relationship between themarketing campaign and the final customer purchase, through an intermediary(entity in the middle).Next, consider an anti-money laundering (AML) analysis case: persons A and C aresuspected of illicit trafficking. Any interaction between the two (e.g., a financial trans‐action in a financial database) would be flagged by the authorities, and heavily scruti‐nized. However, if A and C never transact business together, but instead conductfinancial dealings through safe, respected, and unflagged financial authority B, whatcould pick up on the transaction? The graph analytics algorithm! The graph enginewould discover the transitive relationship between A and C through intermediary B.In internet searches, major search engines use a hyperlinked network (graph-based)algorithm to find the central authoritative node across the entire internet for anygiven set of search words. The directionality of the edge is vital in this case, since theauthoritative node in the network is the one that many other nodes point at.With literature-based discovery (LBD)—a knowledge network (graph-based) applica‐tion enabling significant discoveries across the knowledge base of thousands (or evenmillions) of research journal articles—“hidden knowledge” is discovered onlythrough the connection between published research results that may have manydegrees of separation (transitive relationships) between them. LBD is being applied tocancer research studies, where the massive semantic medical knowledge base ofsymptoms, diagnoses, treatments, drug interactions, genetic markers, short-termresults, and long-term consequences could be “hiding” previously unknown cures orbeneficial treatments for the most impenetrable cases. The knowledge could alreadybe in the network, but we need to connect the dots to find it.Similar descriptions of the power of graphing can be given for the other use cases lis‐ted earlier, all examples of network analysis through graph algorithms. Each casedeeply involves entities (people, objects, events, actions, concepts, and places) andtheir relationships (touch points, both causal and simple associations).When considering the power of graphing, we should keep in mind that perhaps themost powerful node in a graph model for real-world use cases might be “context.”Context may include time, location, related events, nearby entities, and more. Incor‐xiv Foreword

porating context into the graph (as nodes and as edges) can thus yield impressive pre‐dictive analytics and prescriptive analytics capabilities.Mark Needham and Amy Hodler’s Graph Algorithms aims to broaden our knowledgeand capabilities around these important types of graph analyses, including algo‐rithms, concepts, and practical machine lea

This book is a practical guide to getting started with graph algorithms for developers and data scientists who have experience using Apache Spark or Neo4j. Although our algorithm examples utilize the Spark and Neo4j platforms, this book will also be help‐ ful for understanding more general graph concepts, regardless of your choice of

Related Documents:

Matthew 27 Matthew 28 Mark 1 Mark 2 Mark 3 Mark 4 Mark 5 Mark 6 Mark 7 Mark 8 Mark 9 Mark 10 Mark 11 Mark 12 Mark 13 Mark 14 Mark 15 Mark 16 Catch-up Day CORAMDEOBIBLE.CHURCH/TOGETHER PAGE 1 OF 1 MAY 16 . Proverbs 2—3 Psalms 13—15 Psalms 16—17 Psalm 18 Psalms 19—21 Psalms

Amy represented clients in a variety Amy Ahrens Earns Kansas Association for Justice Raymond Spring Award Amy Ahrens of domestic and criminal cases. In one case, Amy worked long hours on a post-conviction petition that remained pending beyond her clinic semester. Amy wanted to see the case through, so she volunteered her time in the clinic during

Amy Moore 304 Amy Moore 533 amy sheppard 955 AMY BRADSHAW 480 AMY BRADSHAW 149 Amy Travell 1100 . Benjamin Lewis 327 Benjamin Lewis 908 . Benjamin Lewis 1869 Beth Morgan 1975 . Edward Barrington 315 . Edward Barrington 330 Edward Barri

Amy's new study guides Amy has been busy writing new guides! Wench by JP Designs 18m 9" x 13.25" 10100 Stitch guide by Amy Bunger 8750 Thread Kit 18350 Beach Birds by Melissa Shirley 18m 17" x 9.5" 24150 Stitch guide by Amy Bunger 10500 Thread Kit 19680 Home is Not a Place. by Zecca 18m 25" x 7" 20400 Stitch guide by Amy .

4 November 2015 (SARS) 13 Attachment C Email from Needham to Financial Accountant dated 5 November 2015 (revenue systems) 14-17 Attachment D Email from Needham to Financial Accountant dated 12 November 2015 (day clinics) 18-19 Attachment E Email from Malcolm to Needham dated 13 November 2015 (attorney's letter) 20-22

4 kkk Amy Foundation Outreach 5 k Amy Internet Syndicate 6 kkk Amy Writing Award Winners 8 kkk The United States—A Discipled Nation A Newsletter of the Amy Foundation We obey, God advances When you work in God's mission fields, at home or abroad, know that His Word, ". does not come back void, but accomplishes the purpose for which it .

GE Power NETWORK SECURITY TIL FOR MARK V, VI AND VIE CONTROLLER PLATFORMS APPLICATION All control panels using Mark V or Mark Ve and Mark VI or Mark VIe control platforms. Includes the following Mark VI, VIe, Mark V, Ve generation controllers: EX2100, EX2100e, LS2100, LS2100e, and Mark VIeS systems. All "Windows" based HMI

Mark V Premium, Mark V Max, Mark VII Max, Mark X Premium, and Mark X Max Electric Airless Sprayers For Portable Airless Spraying of Architectural Coatings and Paints. For professional use only. Not approved for use in European explosive atmosphere locations. 3300 psi (227 bar, 22.7 MPa) Maximum Working Pressure IMPORTANT SAFETY INSTRUCTIONS