NSF14-43054 Started October 1, 2014 Datanet: CIF21 DIBBs .

2y ago
14 Views
3 Downloads
3.79 MB
41 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Francisco Tran
Transcription

NSF14-43054 started October 1, 2014Datanet: CIF21 DIBBs: Middlewareand High Performance AnalyticsLibraries for Scalable Data Science Indiana University (Fox, Qiu, Crandall, von Laszewski)Rutgers (Jha)Virginia Tech (Marathe)Kansas (Paden)Stony Brook (Wang)Arizona State (Beckstein)Utah (Cheatham)Overview by Geoffrey Fox, May 16, ww.nsf.gov/awardsearch/showAward?AWD ID 14430541

Some Important Components of SPIDAL Dibbs NIST Big Data Application Analysis – features of data intensive Applications.HPC-ABDS: Cloud-HPC interoperable software performance of HPC (HighPerformance Computing) and the rich functionality of the commodity Apache BigData Stack.– This is a reservoir of software subsystems – nearly all from outside the project and being amix of HPC and Big Data communities.– Leads to Big Data – Simulation – HPC Convergence. MIDAS: Integrating Middleware – from project.Applications: Biomolecular Simulations, Network and Computational SocialScience, Epidemiology, Computer Vision, Spatial Geographical Information Systems,Remote Sensing for Polar Science and Pathology Informatics. SPIDAL (Scalable Parallel Interoperable Data Analytics Library):Scalable Analytics for:– Domain specific data analytics libraries – mainly from project.– Add Core Machine learning libraries – mainly from community.– Performance of Java and MIDAS Inter- and Intra-node. Benchmarks – project adds to communityWBDB2015 Benchmarking WorkshopImplementations: XSEDE and Blue Waters as well as clouds (OpenStack, Docker)2

Big Data - Big Simulation (Exascale) Convergence Discuss Data and Model together as built around problems which combinethem, but we can get insight by separating which allows betterunderstanding of Big Data - Big Simulation “convergence” Big Data implies Data is large but Model varies– e.g. LDA with many topics or deep learning has large model– Clustering or Dimension reduction can be quite small Simulations can also be considered as Data and Model– Model is solving particle dynamics or partial differential equations– Data could be small when just boundary conditions– Data large with data assimilation (weather forecasting) or when data visualizations areproduced by simulation Data often static between iterations (unless streaming); Model variesbetween iterations Take 51 NIST and other use cases derive multiple specific features Generalize and systematize with features termed “facets” 50 Facets (Big Data) or 64 Facets (Big Simulation and Data) divided into4 sets or views where each view has “similar” facets– Allows one to study coverage of benchmark sets and architectures3

64 Features in 4 views for Unified Classification of Big Dataand Simulation ApplicationsCore LibrariesVisualizationGraph AlgorithmsLinear Algebra Kernels/Many subclassesGlobal (Analytics/Informatics/Simulations)Local sNature of mesh if usedEvolution of Discrete SystemsParticles and FieldsN-body MethodsSpectral MethodsMultiscale MethodIterative PDE Solvers(All Model)6D5DArchived/Batched/Streaming – S1, S2, S3, S4, S54DHDFS/Lustre/GPFS3D2D1DFiles/ObjectsEnterprise Data ModelSQL/NoSQL/NewSQLConvergenceDiamondsViews andFacetsPleasingly ParallelClassic MapReduceMap-CollectiveMap Point-to-PointMap StreamingShared MemorySingle Program Multiple DataBulk Synchronous ParallelFusionDataflowProblem Architecture View(Nearly all Data Model)(Nearly all Data)Metadata/ProvenanceShared / Dedicated / Transient / Permanent7DAgentsWorkflow1234567891011MD M D D MM D M D M M D M D M M1 2 3 4 4 5 6 6 7 8 9 9 10 10 11 12 12 13 13 14𝑂 𝑁2 NN / 𝑂(𝑁) NProcessing ViewBig Data ProcessingDiamondsData Source and Style ViewData Metric M / Non-Metric NData Metric M / Non-Metric NModel AbstractionData AbstractionIterative / SimpleRegular R / Irregular I ModelRegular R / Irregular I DataSimulation (Exascale)Processing Diamonds(Model for Data)Internet of ThingsDynamic D / Static SDynamic D / Static SCommunication StructureVeracityModel VarietyData VarietyData VelocityModel Size15 14 13 12 3 2 1M M M M M MMSimulations98DGeospatial Information SystemHPC SimulationsData VolumeExecution Environment; Core librariesFlops per Byte/Memory IO/Flops per watt22 21 20 19 18 17 16 11 10 9 8 7 6 5 4M M M M M MM M M M M M M MMAnalytics10DPerformance MetricsData AlignmentStreaming Data AlgorithmsOptimization MethodologyLearningData ClassificationData Search/Query/IndexRecommender EngineBase Data StatisticsBothExecution View(Mix of Data and Model)124

6 Forms ofMapReduceCover “all”circumstancesDescribes- Problem (Modelreflecting data)- Machine- SoftwareArchitecture5

HPC-ABDSKaleidoscope of (Apache) Big Data Stack (ABDS) and HPC TechnologiesCrossCuttingFunctions1) Messageand DataProtocols:Avro, Thrift,Protobuf2) DistributedCoordination: GoogleChubby,Zookeeper,Giraffe,JGroups3) Security &Privacy:InCommon,EduroamOpenStackKeystone,LDAP, Sentry,Sqrrl, OpenID,SAML OAuth4)Monitoring:Ambari,Ganglia,Nagios, Inca21 layersOver 350SoftwarePackagesJanuary29201617) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA,Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j,H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables,CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, CloudFoundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT,Agave, Atmosphere15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, FacebookPuma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco,Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty,ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: AmazonSNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB,H-Store12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC12) Extraction Tools: UIMA, Tika11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, PivotalGreenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, SesamePublic Cloud: Azure Table, Amazon Dynamo, Google DataStore11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,Torque, Globus Tools, Pilot Jobs8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFSPublic Cloud: Amazon S3, Azure Blob, Google Cloud Storage7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Apiasdf5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox,OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public CloudsNetworking: Google Cloud DNS, Amazon Route 536

HPC-ABDS Mapping of ActivitiesGreen is MIDAS Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integratedwith Cloudmesh on HPC cluster Level 16: Applications: Datamining for molecular dynamics, Imageprocessing for remote sensing and pathology, graphs, streaming,bioinformatics, social media, financial informatics, text mining Level 16: Algorithms: Generic and custom for applications SPIDAL Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,Spark, Flink. Improve Inter- and Intra-node performance Level 13: Communication: Enhanced Storm and Hadoop using HPCruntime technologies, Harp Level 11: Data management: Hbase and MongoDB integrated via use ofBeam and other Apache tools; enhance Hbase Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos,Spark, Hadoop; integrate Storm and Heron with Slurm Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability7

04/6/20168

MIDAS: Software Activities in DIBBS Developing HPC-ABDS concept to integrate HPC and ApacheTechnologies Java: relook at Java Grande to make performance “best simply possible” DevOps: Cloudmesh provides interoperability between HPC and Cloud(OpenStack, AWS, Docker) platforms based on virtual clusters withsoftware defined systems using Ansible (Chef ) Scheduling: Integrate Slurm and Pilot jobs with Yarn & Mesos (ABDSschedulers), Programming layer (Hadoop, Spark, Flink, Heron/Storm) Communication and scientific data abstractions: Harp plug-in toHadoop outperforms ABDS programming layers Data Management: use Hbase, MongoDB with customization Workflow: Use Apache Crunch and Beam (Google Cloud Data flow) asthey link to other ABDS technologies. Starting to integrate MIDAS components and move into algorithms ofSPIDAL Library9

Java MPI performs better than Threads128 24 core Haswell nodes on SPIDAL DA-MDS CodeSM OptimizedShared memory forintra-node MPIBest MPI; interand intra nodeBest Threads intra nodeAnd MPI inter node04/6/2016HPC into Java Runtime and Programming Model10

Cloudmesh Interoperability DevOps Tool Model: Define software configuration with tools like Ansible; instantiate on a virtualclusterAn easy-to-use command line program/shell and portal to interface withheterogeneous infrastructures– Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supportedclouds as well as classic HPC and Docker infrastructures– Has an abstraction layer that makes it possible to integrate other IaaSframeworks– Uses defaults that help interacting with various clouds– Managing VMs across different IaaS providers is easy– The client saves state between consecutive callsDemonstrated interaction with various cloud providers:– FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,virtualboxStatus: AWS, and Azure, VirtualBox, Docker need improvements; we focus currentlyon Comet and NSF resources that use OpenStackCurrently evaluating 40 team projects from “Big Data Open Source Software ProjectsClass” which used this approach running on VirtualBox, Chameleon andFutureSystems11

Cloudmesh Interoperability DevOps Tool Model: Define software configuration with tools like Ansible; instantiate on a virtualclusterAn easy-to-use command line program/shell and portal to interface withheterogeneous infrastructures– Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supportedclouds as well as classic HPC and Docker infrastructures– Has an abstraction layer that makes it possible to integrate other IaaSframeworks– Uses defaults that help interacting with various clouds– Managing VMs across different IaaS providers is easy– The client saves state between consecutive callsDemonstrated interaction with various cloud providers:– FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,virtualboxStatus: AWS, and Azure, VirtualBox, Docker need improvements; we focus currentlyon Comet and NSF resources that use OpenStackCurrently evaluating 40 team projects from “Big Data Open Source Software ProjectsClass” which used this approach running on VirtualBox, Chameleon andFutureSystemsHPC Cloud Interoperability Layer12

Cloudmesh Client - ArchitectureComponent ViewLayered ViewSystems Configuration13

Cloudmesh Client – OSG managementOSG: Open Science Grid.LIGO data analysis was conducted on Comet supportedby the Cloudmesh client.Funded by NSF Comet. While using Comet it ispossible to use the sameimages that are used onthe internal OSG cluster. This reduces overallmanagement effort. The client is used tomanage the VMs.Cloudmesh14

Cloudmesh Client –In support of Experiment WorkflowManage VMs and virtualclustersScriptsVariablesShared stateCreateIaaSChoose generalInfrastructure HPC to CloudsDeployPaaSRepeatIntegrate with AnsibleCloudmeshClientIntegrationinto Ipythonand ApacheBeamDeployDataEvaluateBuildin gcopy and rsynccommandsExecuteScriptsCloudmesh scriptsIntegrate with shellIntegrate with Python15

Pilot-Hadoop/Spark ArchitectureHPC into Scheduling Layerhttp://arxiv.org/abs/1602.0034516

Pilot-Hadoop Example17

Pilot-Data/Memory for IterativeProcessingScalable K-MeansProvide common APIfor distributedcluster memory18

Harp Implementations Basic Harp: Iterative communication and scientific data abstractions Careful support of distributed data AND distributed model Avoids parameter server approach but distributes model over worker nodesand supports collective communication to bring global model to each node Applied first to Latent Dirichlet Allocation LDA with large model and dataHPC into Programming/communication Layer19

Latent Dirichlet Allocation on 100 Haswell nodes: red is Harp (lgs and rtt)CluewebCluewebenwikiBi-gram04/6/201620

Harp LDA Scaling TestsHarp LDA on Big Red IISupercomputer (Cray)Harp LDA on Juliet (Intel Haswell)251.1301.10.90.8200.70.6150.50.410Parallel EfficiencyExecution Time 0.100050NodesExecution Time (hours)100150Parallel EfficiencyCorpus: 3,775,554 Wikipedia documents,Vocabulary: 1 million words;Topics: 10k topics; alpha: 0.01; beta: 0.01;iteration: 200Parallel Efficiency1Execution Time (hours)100010203040NodesExecution Time (hours)Parallel Efficiency Big Red II: tested on 25, 50, 75, 100 and 125nodes; each node uses 32 parallel threads;Gemini interconnect Juliet: tested on 10, 15, 20, 25, 30 nodes;each node uses 64 parallel threads on 36 coreIntel Haswell nodes (each with 2 chips);Infiniband interconnect21

SPIDAL Algorithms – Subgraph mining Finding patterns in graphs is very important– Counting the number of embeddings of a given labeled/unlabeledtemplate subgraph– Finding the most frequent subgraphs/motifs efficiently from agiven set of candidate templates– Computing the graphlet frequency distribution. Reworking existing parallel VT algorithm Sahad with MIDASmiddleware giving HarpSahad which runs 5 (Google) to 9 (Miami)times faster than original Hadoop version Work in . OfNodes(in million)04/6/20160.92.1No. OfEdges(in million)4.351.2Size(MB)6522740

SPIDAL Algorithms – Random Graph Generation Random graphs, important and needed with particular degree distributionand clustering coefficients. – Preferential attachment (PA) model, Chung-Lu (CL), stochastic Kronecker,stochastic block model (SBM), and block two–level Erdos-Renyi (BTER)– Generative algorithms for these models are mostly sequential and take aprohibitively long time to generate large-scale graphs.SPIDAL working on developing efficient parallel algorithms for generating randomgraphs using different models with new DG method with low memory and highperformance, almost optimal load balancing and excellent scaling.– Algorithms are about 3-4 times faster than the previous ones.– Generate a network with 250 billion edges in 12 seconds using 1024 processors.Needs to be packaged for SPIDAL using MIDAS (currently MPI)04/6/201623

SPIDAL Algorithms – Triangle Counting Triangle counting; important special case of subgraph mining andspecialized programs can outperform general program Previous work used Hadoop but MPI based PATRIC is much faster SPIDAL version uses much more efficient decomposition (non-overlappinggraph decomposition) – a factor of 25 lower memory than PATRIC Next graph problem – Community detectionMPI versioncomplete. Needto package forSPIDAL and addMIDAS -- HarpSPIDAL24

SPIDAL Algorithms – Core I Several parallel core machine learning algorithms; need to add SPIDALJava optimizations to complete parallel codes except MPI MDS– e-learning-with-dsc-spidal/details O(N2) distance matrices calculation with Hadoop parallelism and variousoptions (storage MongoDB vs. distributed files), normalization, packing tosave memory usage, exploiting symmetry WDA-SMACOF: Multidimensional scaling MDS is optimal nonlineardimension reduction enhanced by SMACOF, deterministic annealing andConjugate gradient for non-uniform weights. Used in many applications– MPI (shared memory) and MIDAS (Harp) versions MDS Alignment to optimally align related point sets, as in MDS time series WebPlotViz data management (MongoDB) and browser visualization for3D point sets including time series. Available as source or SaaS MDS as χ2 using Manxcat. Alternative more general but less reliablesolution of MDS. Latest version of WDA-SMACOF usually preferable Other Dimension Reduction: SVD, PCA, GTM to do25

SPIDAL Algorithms – Core II Latent Dirichlet Allocation LDA for topic finding in text collections; new algorithmwith MIDAS runtime outperforming current best practiceDA-PWC Deterministic Annealing Pairwise Clustering for case where pointsaren’t in a vector space; used extensively to cluster DNA and proteomicsequences; improved algorithm over other published. Parallelism good but needsSPIDAL JavaDAVS Deterministic Annealing Clustering for vectors; includes specification oferrors and limit on cluster sizes. Gives very accurate answers for cases wheredistinct clustering exists. Being upgraded for new LC-MS proteomics data with onemillion clusters in 27 million size data setK-means basic vector clustering: fast and adequate where clusters aren’tneeded accuratelyElkan’s improved K-means vector clustering: for high dimensional spaces; usestriangle inequality to avoid expensive distance calcsFuture work – Classification: logistic regression, Random Forest, SVM, (deeplearning); Collaborative Filtering, TF-IDF search and Spark MLlib algorithmsHarp-DaaL extends Intel DAAL’s local batch mode to multi-node distributed modes– Leveraging Harp’s benefits of communication for iterative compute models26

SPIDAL Algorithms – Optimization I Manxcat: Levenberg Marquardt Algorithm for non-linear χ2optimization with sophisticated version of Newton’s methodcalculating value and derivatives of objective function. Parallelism incalculation of objective function and in parameters to be determined.Complete – needs SPIDAL Java optimization Viterbi algorithm, for finding the maximum a posteriori (MAP) solutionfor a Hidden Markov Model (HMM). The running time is O(n*s 2)where n is the number of variables and s is the number of possiblestates each variable can take. We will provide an "embarrassinglyparallel" version that processes multiple problems (e.g. manyimages) independently; parallelizing within the same problem notneeded in our application space. Needs Packaging in SPIDAL Forward-backward algorithm, for computing marginal distributionsover HMM variables. Similar characteristics as Viterbi above. NeedsPackaging in SPIDAL27

SPIDAL Algorithms – Optimization II Loopy belief propagation (LBP) for approximately finding the maximum aposteriori (MAP) solution for a Markov Random Field (MRF). Here therunning time is O(n 2*s 2*i) in the worst case where n is number ofvariables, s is number of states per variable, and i is number of iterationsrequired (which is usually a function of n, e.g. log(n) or sqrt(n)). Here thereare various parallelization strategies depending on values of s and n for anygiven problem.– We will provide two parallel versions: embarrassingly parallel version forwhen s and n are relatively modest, and parallelizing each iteration of thesame problem for common situation when s and n are quite large so thateach iteration takes a long time relative to number of iterations required.– Needs Packaging in SPIDAL Markov Chain Monte Carlo (MCMC) for approximately computing markingdistributions and sampling over MRF variables. Similar to LBP with the sametwo parallelization strategies. Needs Packaging in SPIDAL28

Imaging Applications: Remote Sensing,Pathology, Spatial Systems Both Pathology/Remote sensing working on 3D imagesEach pathology image could have 10 billion pixels, and we may extract a millionspatial objects per image and 100 million features (dozens to 100 features perobject) per image. We often tile the image into 4K x 4K tiles for processing. Wedevelop buffering-based tiling to handle boundary-crossing objects. For eachtypical research study, we may have hundreds to thousands of pathologyimagesRemote sensing aimed at radar images of ice and snow sheets2D problems need modest parallelism “intra-image” but often need parallelismover images3D problems need parallelism for an individual imageUse Optimization algorithms to support applications (e.g. Markov Chain, IntegerProgramming, Bayesian Maximum a posteriori, variational level set, EulerLagrange Equation)Classification (deep learning convolution neural network, SVM, random forest,etc.) will be important29

2D Radar Polar Remote Sensing Need to estimate structure of earth (ice, snow, rock) from radar signals fromplane in 2 or 3 dimensions. Original 2D analysis ([11])used Hidden Markov Methods; better results usingMCMC (our solution)Extending tosnow radarlayers04/6/201630

3D Radar Polar Remote Sensing Uses LBP to analyze 3D radar imagesRadar gives a cross-section view,parameterized by angle andrange, of the ice structure, whichyields a set of 2-d tomographicslices (right) along the flight path.Each imagerepresents a 3ddepth map, withalong track and crosstrack dimensions onthe x-axis and y-axisrespectively, anddepth coded ascolors.Reconstructing bedrock in 3D, for (left) ground truth, (center) existing algorithmbased on maximum likelihood estimators, and (right)our technique based on a04/6/201631Markov Random Field formulation.

Algorithms – Nuclei Segmentation forPathology ImagesStep 2:BackgroundNormalizationStep 3:Nuclear BoundaryRefinementStep 4:NucleiSeparationTilingF1TilingF2Tiled ImagesSegmentationMask ImagesF3Boundary VectorizationMapDistributed File SystemStep 1:Preliminary RegionPartitionWhole Slide ImagesMapReduce Computing Framework Segment boundaries of nuclei from pathologyimages and extract features for each nucleus Consist of tiling, segmentation,vectorization, boundary object aggregation Could be executed on MapReduce(MIDAS Harp)Raw PolygonsBoundary NormalizationF5F4F6F7Normalized PolygonsIsolated Buffer Polygon RemovalCleaned PolygonsReduceF8Non-boundary BoundaryPolygonsPolygonsWhole Slide Polygon AggregationFinal SegmentedPolygonsNuclear segmentation algorithmF9F10F11Execution pipeline on MapReduce(MIDAS Harp)32

Algorithms – Spatial Querying Methods Hadoop-GIS is a general framework to support high performance spatialqueries and analytics for spatial big data on MapReduce. It supports multiple types of spatial queries on MapReduce through spatialpartitioning, customizable spatial query engine and on-demand indexing. SparkGIS is a variation of Hadoop-GIS which runs on Spark to takeadvantage of in-memory processing. Will extend Hadoop/Spark to Harp MIDAS runtime. 2D complete; 3D in progressSpatial QueriesArchitecture of Spatial Query Engine33

Some Applications Enabled KU/IU: Remote Sensing in Polar RegionsSB: Digital Pathology ImagingSB: Large scale GIS applications, including public healthVT: Graph Analysis in studies of networks in many areasUT, ASU, Rutgers: Analysis of Biomolecular simulationsApplications not part of Dibbs project but algorithms/software used IU: Bioinformatics and Financial Modeling with MDS Integration with Network Science Infrastructure– VT/IU: CINET: SPIDAL algorithms will be made available– IU: Osome Observatory on Social Media, currently Twitterhttps://peerj.com/preprints/2008/ using enhanced HBase– IU: Topic Analysis of text data IU/Rutgers: Streaming with HPC enhanced Storm/Heron34

Enabled Applications – Digital PathologyGlass SlidesScanningWhole Slide ImagesImage Analysis Digital pathology images scanned from human tissue specimens providerich information about morphological and functional characteristics ofbiological systems. Pathology image analysis has high potential to provide diagnosticassistance, identify therapeutic targets, and predict patient outcomes andtherapeutic responses. It relies on both pathology image analysis algorithms and spatial queryingmethods. Extremely large image scale.35

Applications – Public Health GIS-oriented public health research has a strong focus on the locations ofpatients and the agents of disease, and studies the spatial patterns andvariations. Integrating multiple spatial big data sources at fine spatial resolutions allowpublic health researchers and health officials to adequately identify,analyze, and monitor health problems at the community level. This will rely on high performance spatial querying methods on dataintegration. Note synergy between GIS and Large image processing as in pathology.36

Biomolecular Simulation Data Analysis Utah (CPPTraj), Arizona State (MDAnalysis), Rutgers Parallelize key algorithms including O(N2) distancecomputations between trajectories Integrate SPIDAL O(N2) distance and clustering librariesPath Similarity Analysis (PSA) withHausdorff distance04/6/201637

RADICAL-Pilot Hausdorff distance:all-pairs problemClustered distances for two methodsfor sampling macromoleculartransitions (200 trajectories each)showing that both methods producedistinctly different pathways.RADICAL Pilot benchmark runfor three different test sets oftrajectories, using 12x12“blocks” per task.38

Classification of lipids in membranesBiological membranes are lipid bilayers with distinct inner and outer surfacesthat are formed by lipid mono layers (leaflets). Movement of lipids betweenleaflets or change of topology (merging of leaflets during fusion events) isdifficult to detect in simulations.Lipids colored by leafletSame color: continuous leaflet.39

LeafletFinderLeafletFinder is a graph-based algorithm to detect continuouslipid membrane leaflets in a MD simulation*. The currentimplementation is slow and does not work well for large systems( 100,000 lipids).* N. Michaud-Agrawal, E. J. Denning, T. B.Woolf, and O. Beckstein. MDAnalysis: A toolkitfor the analysis of molecular dynamicssimulations. J Comp Chem, 32:2319–2327,2011.PhosphateatomcoordinatesBuild nearestneighborsadjacency matrixFind largestconnectedsubgraphs40

Time series of Stock Values projected to 3DUsing one day stock values measured from January 2004 and starting after one yearJanuary 2005 (Filled circles are final values)Origin0% changeFinanceEnergyDow JonesS&PMid Cap 10%Apple 20%41

NIST Big Data Application Analysis – features of data intensive Applications. HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. – This is a reservoir of software sub

Related Documents:

Sunday Worship 9:30 am Book Sorting Party October 7, 2016 Book Sale October 8, 2016 MOVIE NIGHT October 7 Pizza 5:30 pm Advent Heritage Day October 15 Congregational Conversation October 23 Women of Advent October 29 Youth Halloween Party October 29 Reformation Sunday October 30 Our Mission Statement:

October 1 Family Caregiving 101: Difficult Topics 6 October 1 In Search of the Story of Beer 17 October 1 TEEN Black-Out Poetry 0 October 2 Wine Club 7 October 2,9,16, 23,30 Friday Features 96 October 4 Lyric Opera Lecture: Cinderella 15 October 5, 12, 19 Sewing Club 17 October 5 A Conversation with Mary Schmich 101

bert Humperdinck October 14th, Rumours of Fleetwood MAC on October 18th, So You Think You Can Dance Live! 2018 on October 19th, Dean Lives: A Musical Salute on October 20th, Decades Rewind on October 26th, Felix Cavaliere & Gene Cor-nish’s Rascals on October 27th, and An Evening with Danny Aiello and Screening of Stiffs on October 28th. For .

Larry L. Morris Stan Lopata October 10 Robert Bakman Kenneth Wilson Mark Olay October 14 Thomas F. Biernat Bill Cronic October 21 Rick Nolasco October 22 Sam Guerra R. Dean Julian October 25 Robert L. Merritt October 31

at 250-766-3146 or email at st.edwards@shaw.ca Mass Intentions October 1- Marie Robinson RIP October 2-Ida Whelan RIP October 5- The Appel family INT October 6- Jamie Reynolds RIP October 7- Kay O’Sullivan RIP October 8- The Reynolds family INT October 9- Dave Tutt RIP

Overall Daytona Beach Area Occupancy decreased about 6%, to 54% in October 2019 from 58% in October 2018. The Daytona Beach Area Average Daily Rate decreased about 5%, to 113.07 in October 2019 from 118.66 in October 2018. The Daytona Beach Area Revenue per Available Room decreased about 11%, to 60.91 in October 2019 from 68.22 in October 2018.

Temple Beth Shalom E-mail address: tbs@tbsohio.org Affiliation: Reform Address: 5089 Johnstown Road, New Albany, OH 43054 Website: www.tbsohio.org Phone: 614.855.4684 Temple Israel Rabbi’s name: Rabbi Misha Zinkow Contact name: Elaine Tenenbaum E-mail address: etenenbaum@templeisrael.org Affiliation: Reform Address: 5419 East Broad Street,

Completed ASTM D2992 testing to achieve ASTM D2996 and D2310 HDB Category U rating of 12,500 PSI Used to regulate air flow or shut off and isolate a system, Belco AMCA-licensed fiberglass dampers are corrosion resistant and designed to match operating conditions of the duct system. Premium vinyl-ester resins are used throughout the damper. Fire retardant resins are also available for a .