Azure Synapse Analytics

2y ago
96 Views
22 Downloads
9.29 MB
229 Pages
Last View : 8d ago
Last Download : 2m ago
Upload by : Averie Goad
Transcription

Azure Synapse AnalyticsJames SerraData & AI ArchitectMicrosoft, NYC MTCJamesSerra3@gmail.comBlog: JamesSerra.com

About Me Microsoft, Big Data Evangelist In IT for 30 years, worked on many BI and DW projects Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDMarchitect, PDW/APS developer Been perm employee, contractor, consultant, business owner Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft AzureSolutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud DataPlatform Solutions Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

Agenda IntroductionStudioData IntegrationSQL AnalyticsData Storage and Performance OptimizationsSQL On-DemandSparkSecurityConnected Services

Azure Synapse Analytics is a limitless analytics service, that brings togetherenterprise data warehousing and Big Data analytics. It gives you the freedomto query data on your terms, using either serverless on-demand or provisionedresources, at scale. Azure Synapse brings these two worlds together with aunified experience to ingest, prepare, manage, and serve data for immediatebusiness intelligence and machine learning needs.

Azure Synapse – SQL Analyticsfocus areasBest in class priceper performanceIndustry-leadingsecurityWorkload awarequery executionData flexibilityDeveloperproductivityUp to 94% less expensivethan competitorsDefense-in-depthsecurity and 99.9%financially backedavailability SLAManage heterogenousworkloads throughworkload priorities andisolationIngest variety of datasources to derive themaximum benefit.Use preferred tooling forSQL data warehousedevelopmentQuery all data.

Leveraging ISV partners with Azure Synapse AnalyticsAzure Data ShareEcosystem many moreAzure Synapse AnalyticsPower BIAzure Machine Learning

What workloads are NOT suitable?Operational workloads (OLTP) High frequency reads and writes.SQL Large numbers of singletonselects. High volumes of single rowinserts.Data Preparations Row by row processing needs. Incompatible formats (XML).SQL

What Workloads are Suitable?AnalyticsStore large volumes of data.Consolidate disparate data into a single location.Shape, model, transform and aggregate data.Batch/Micro-batch loads.Perform query analysis across large datasets.Ad-hoc reporting across large data volumes.All using simple SQL constructs.

Azure Synapse AnalyticsIntegrated data platform for BI, AI and continuous intelligenceArtificial Intelligence / Machine Learning / Internet of ThingsIntelligent Apps / Business IntelligenceSynapse Analytics ureData Lake StorageLanguagesSQLPython.NETJavaScalaRForm FactorsPROVISIONEDON-DEMANDAnalytics RuntimesDATA INTEGRATIONCommon Data ModelEnterprise SecurityOptimized for Analytics

Integrated data platform for BI, AI and continuous intelligenceArtificial Intelligence / Machine Learning / Internet of ThingsIntelligent Apps / Business IntelligenceSynapse Analytics StudioExperienceConnected gesSQLPython.NETJavaScalaRForm FactorsPROVISIONEDON-DEMANDAzure Data CatalogAzure Data Lake StorageAzure Data ShareAzure DatabricksAzure HDInsightAnalytics RuntimesAzure Machine LearningPower BIMETASTOREAzureData Lake StorageDATA INTEGRATION3rd Party IntegrationCommon Data ModelEnterprise SecurityOptimized for Analytics

Provisioning Synapse workspaceProviding Synapse is easySubscriptionResource GroupWorkspace NameRegionData Lake Storage Account

Synapse workspace

SQL pools

Apache Spark pools

Azure Synapse AnalyticsStudio

Azure Synapse AnalyticsIntegrated data platform for BI, AI and continuous intelligenceArtificial Intelligence / Machine Learning / Internet of ThingsIntelligent Apps / Business IntelligenceSynapse Analytics ureData Lake StorageLanguagesSQLPython.NETJavaScalaRForm FactorsPROVISIONEDON-DEMANDAnalytics RuntimesDATA INTEGRATIONCommon Data ModelEnterprise SecurityOptimized for Analytics

Studiohttps://web.azuresynapse.netA single place for Data Engineers, Data Scientists, and IT Pros to collaborate on enterprise analytics

Synapse StudioSynapse Studio divided into Activity hubs.These organize the tasks needed for building analytics s to commongestures, most-recently useditems, and links to tutorialsand documentation.Explore structured andunstructured dataWrite code and the definebusiness logic of the pipelinevia notebooks, SQL scripts,Data flows, etc.Design pipelines that thatmove and transform data.MonitorManageCentralized view of all resourceusage and activities in theworkspace.Configure the workspace, pool,access to artifacts

Synapse StudioOverview hub

Overview HubIt is a starting point for the activities with key links to tasks, artifacts and documentation

Overview HubOverviewNew dropdown – offers quickly start workitemRecent & Pinned – Lists recently openedcode artifacts. Pin selected ones for quickaccess

Synapse StudioData hub

Data HubExplore data inside the workspace and in linked storage accounts

Data Hub – Storage accountsBrowse Azure Data Lake Storage Gen2 accounts and filesystems – navigate through folders to see dataFilepathADLS Gen2 AccountContainer (filesystem)

Data Hub – Storage accountsPreview a sample of your data

Data Hub – Storage accountsSee basic file properties

Data Hub – Storage accountsManage Access - Configure standard POSIX ACLs on files and folders

Data Hub – Storage accountsTwo simple gestures to start analyzing with SQL scripts or with notebooks.T-SQL or PySpark auto-generated.

Data Hub – Storage accountsSQL Script from Multiple filesMulti-select of files generates a SQL script that analyzes all those files together

Data Hub – DatabasesExplore the different kinds of databases that exist in a workspace.SQL poolSQL on-demandSpark

Data Hub – DatabasesFamiliar gesture to generate T-SQL scripts from SQLmetadata objects such as tables.Starting from a table, auto-generate a single line of PySpark codethat makes it easy to load a SQL table into a Spark dataframe

Data Hub – DatasetsOrchestration datasets describe data that is persisted. Once a dataset is defined, it can be used in pipelines andsources of data or as sinks of data.

Synapse StudioDevelop hub

Develop HubOverviewIt provides development experience toquery, analyze, model dataBenefitsMultiple languages to analyze dataunder one umbrellaSwitch over notebooks and scriptswithout loosing contentCode intellisense offers reliable codedevelopmentCreate insightful visualizations

Develop Hub - SQL scriptsSQL ScriptAuthoring SQL ScriptsExecute SQL script on provisioned SQL Pool or SQLOn-demandPublish individual SQL script or multiple SQLscripts through Publish all featureLanguage support and intellisense

Develop Hub - SQL scriptsSQL ScriptView results in Table or Chart form and export results inseveral popular formats

Develop Hub - NotebooksNotebooksAllows to write multiple languages in onenotebook%% Name of language Offers use of temporary tables acrosslanguagesLanguage support for Syntax highlight, syntaxerror, syntax code completion, smart indent,code foldingExport results

Develop Hub - NotebooksConfigure session allows developers to control how many resourcesare devoted to running their notebook.

Develop Hub - NotebooksAs notebook cells run, the underlyingSpark application status is shown.Providing immediate feedback andprogress tracking.

Dataflow CapabilitiesHandle upserts, updates,deletes on sql sinksAdd new partition methodsAdd schema drift supportAdd file handling (move filesafter read, write files to filenames described in rows etc)New inventory of functions(for e.g Hash functions forrow comparison)Commonly used /SCD )Data lineage – Capturing sinkcolumn lineage & impactanalysis(invaluable if this isfor enterprise deployment)Implement commonly usedETL patterns astemplates(SCD Type1, Type2,Data Vault)

Develop Hub - Data FlowsData flows are a visual way of specifying how to transform data.Provides a code-free experience.

Develop Hub – Power BIOverviewCreate Power BI reports in the workspaceProvides access to published reports in theworkspaceUpdate reports real time from Synapseworkspace to get it reflected on Power BIserviceVisually explore and analyze data

Develop Hub – Power BIView published reports in Power BI workspace

Develop Hub – Power BIEdit reports in Synapse workspace

Develop Hub – Power BIPublish edited reports in Synapse workspace to Power BI workspacePublish changes by simple savereport in workspace

Real-time publish on save

Synapse StudioOrchestrate hub

Orchestrate HubIt provides ability to create pipelines to ingest, transform and load data with 90 inbuilt connectors.Offers a wide range of activities that a pipeline can perform.

Synapse StudioMonitor hub

Monitor HubOverviewThis feature provides ability to monitor orchestration, activities and compute resources.

Monitoring Hub - OrchestrationOverviewMonitor orchestration in the Synapse workspace for theprogress and status of pipelineBenefitsTrack all/specific pipelinesMonitor pipeline run and activity run detailsFind the root cause of pipeline failure or activity failure

Monitoring Hub - Spark applicationsOverviewMonitor Spark pools, Spark applications for the progress andstatus of activitiesBenefitsMonitor Spark pools for the status as paused, active,resume, scaling and upgradingTrack the usage of resources

Synapse StudioManage hub

Manage HubOverviewThis feature provides ability to manage Linked Services, Orchestration and Security.

Manage – Linked servicesOverviewIt defines the connection information needed toconnect to external resources.BenefitsOffers pre-build 90 connectorsEasy cross platform data migrationRepresents data store or compute resources

Manage – Access ControlOverviewIt provides access control management to workspaceresources and artifacts for admin and usersBenefitsShare workspace with the teamIncreases productivityManage permissions on code artifacts and Sparkpools

Manage – TriggersOverviewIt defines a unit of processing that determines when apipeline execution needs to be kicked off.BenefitsCreate and manage Schedule trigger Tumbling window trigger Event triggerControl pipeline execution

Manage – Integration runtimesOverviewIntegration runtimes are the compute infrastructure used byPipelines to provide the data integration capabilities acrossdifferent network environments. An integration runtimeprovides the bridge between the activity and linked services.BenefitsOffers Azure Integration Runtime or Self-Hosted IntegrationRuntimeAzure Integration Runtime – provides fully managed,serverless compute in AzureSelf-Hosted Integration Runtime – use compute resources inon-premises machine or a VM inside private network

Azure Synapse AnalyticsData Integration

Azure Synapse AnalyticsIntegrated data platform for BI, AI and continuous intelligenceArtificial Intelligence / Machine Learning / Internet of ThingsIntelligent Apps / Business IntelligenceSynapse Analytics ureData Lake StorageLanguagesSQLPython.NETJavaScalaRForm FactorsPROVISIONEDON-DEMANDAnalytics RuntimesDATA INTEGRATIONCommon Data ModelEnterprise SecurityOptimized for Analytics

Orchestration @ tyActivitySelf-hostedIntegration RuntimeLinkedServiceAzureIntegration RuntimeLEGENDCommand and ControlData

Data MovementScalableper job elasticityUp to 4 GB/sSimpleVisually author or via code (Python, .Net, etc.)Serverless, no infrastructure to manageAccess all your data90 connectors provided and growing (cloud, on premises, SaaS)Data Movement as a Service: 25 points of presence worldwideSelf-hostable Integration Runtime for hybrid movement

90 Connectors out of the boxAzure (15)Database & DW (26)File Storage (6)FileFormats(6)NoSQL (3)Services and App (28)Generic (4)Blob storageAmazon RedshiftOracleAmazon S3AVROCassandraAmazon MWSOracle Service CloudGeneric HTTPCosmos DB - SQL APIDB2PhoenixFile systemBinaryCouchbaseCDS for AppsPayPalGeneric ODataCosmos DB - MongoDBAPIDrillPostgreSQLFTPDelimited TextMongoDBConcurQuickBooksGeneric ODBCData ExplorerGoogleBigQueryPrestoJSONDynamics 365SalesforceGeneric RESTData Lake Storage Gen1GreenplumSAP BW OpenHubGoogle CloudStorageHDFSORCDynamics AXSF Service CloudData Lake Storage Gen2HBaseSAP BW via MDXSFTPParquetDynamics CRMSF Marketing CloudDatabase for MariaDBHiveSAP HANAGoogle AdWordsSAP C4CDatabase for MySQLApache ImpalaSAP tableHubSpotSAP ECCDatabase for PostgreSQLInformixSparkJiraServiceNowFile StorageMariaDBSQL ServerMagentoShopifySQL DatabaseMicrosoft AccessSybaseMarketoSquareSQL Database MIMySQLTeradataOffice 365Web tableSQL Data WarehouseNetezzaVerticaOracle EloquaXeroOracle ResponsysZohoSearch indexTable storage

PipelinesOverviewIt provides ability to load data from storageaccount to desired linked service. Load data bymanual execution of pipeline or byorchestrationBenefitsSupports common loading patternsFully parallel loading into data lake or SQLtablesGraphical development experience

Prep & Transform DataMapping DataflowWrangling DataflowCode free data transformation @scaleCode free data preparation @scale

TriggersOverviewTriggers represent a unit of processing thatdetermines when a pipeline execution needs to bekicked off.Data Integration offers 3 trigger types as –1.Schedule – gets fired at a schedule withinformation of start date, recurrence, end date2.Event – gets fired on specified event3.Tumbling window – gets fired at a periodic timeinterval from a specified start date, whileretaining stateIt also provides ability to monitor pipeline runs andcontrol trigger execution.

Manage – Linked ServicesOverviewIt defines the connection information needed forPipeline to connect to external resources.BenefitsOffers pre-build 85 connectorsEasy cross platform data migrationRepresents data store or compute resourcesNOTE: Linked Services are all for Data Integrationexcept for Power BI (eventually ADC, Databricks)

Manage – Integration runtimesOverviewIt is the compute infrastructure used by Pipelines to providethe data integration capabilities across different networkenvironments. An integration runtime provides the bridgebetween the activity and linked Services.BenefitsOffers Azure Integration Runtime or Self-Hosted IntegrationRuntimeAzure Integration Runtime – provides fully managed,serverless compute in AzureSelf-Hosted Integration Runtime – use compute resources inon-premises machine or a VM inside private network

Azure Synapse AnalyticsSQL Analytics

Azure Synapse AnalyticsIntegrated data platform for BI, AI and continuous intelligenceArtificial Intelligence / Machine Learning / Internet of ThingsIntelligent Apps / Business IntelligenceSynapse Analytics ureData Lake StorageLanguagesSQLPython.NETJavaScalaRForm FactorsPROVISIONEDON-DEMANDAnalytics RuntimesDATA INTEGRATIONCommon Data ModelEnterprise SecurityOptimized for Analytics

Platform: PerformanceOverviewSQL Data Warehouse’s industry leading price-performancecomes from leveraging the Azure ecosystem and core SQLServer engine improvements to produce massive gains inperformance.These benefits require no customer configuration and areprovided out-of-the-box for every data warehouse Gen2 adaptive caching – using non-volatile memory solidstate drives (NVMe) to increase the I/O bandwidthavailable to queries. Azure FPGA-accelerated networking enhancements – tomove data at rates of up to 1GB/sec per node to improvequeries Instant data movement – leverages multi-core parallelismin underlying SQL Servers to move data efficiently betweencompute nodes. Query Optimization – ongoing investments in distributedquery optimization

TPC-H 1 Petabyte query timesThe first and onlyanalytics system to haverun all TPC-H queriesat petabyte-scale12345678910 11 12 13 14 15 16 17 18 19 20 21 22TPC-H queries

TPC-H 1 Petabyte Query ExecutionAzure Synapse is the firstand only analyticssystem to have run allTPC-H queries at 1petabyte-scale12345678910 11 12 13 14 15 16 17 18 19 20 21 22TPC-H queries

Azure Synapse Analytics SQL Comprehensive SQL functionalityAdvanced storage systemT-SQL QueryingComplete SQL object model Columnstore Indexes Windowing aggregates Tables Table partitions Views Distributed tablesApproximate execution(Hyperloglog) Stored procedures Isolation modes Functions Materialized Views Nonclustered Indexes Result-set caching JSON data support

Azure Synapse Analytics SQL Windowing functionsOVER clauseDefines a window or specified set of rows within a queryresult setComputes a value for each row in the windowAggregate functionsCOUNT, MAX, AVG, SUM, APPROX COUNT DISTINCT,MIN, STDEV, STDEVP, STRING AGG, VAR, VARP,GROUPING, GROUPING ID, COUNT BIG, CHECKSUM AGGSELECTROW NUMBER() OVER(PARTITION BY PostalCode ORDER BY SalesYTD DESC) AS "Row Number",LastName,SalesYTD,PostalCodeFROM SalesWHERE SalesYTD 0ORDER BY PostalCode;Row 80552Varkey 055ROWS RANGE4Ito2458535.616998055PRECEDING, UNBOUNDING PRECEDING, CURRENT ROW,BETWEEN, FOLLOWING, UNBOUNDED 1810.924298055Ranking functionsRANK, NTILE, DENSE RANK, ROW NUMBERAnalytical functionsLAG, LEAD, FIRST VALUE, LAST VALUE, CUME DIST,PERCENTILE CONT, PERCENTILE DISC, PERCENT RANK

Azure Synapse Analytics SQL Windowing Functions (continued)Analytical functionsLAG, LEAD, FIRST VALUE, LAST VALUE, CUME DIST,PERCENTILE CONT, PERCENTILE DISC, PERCENT RANK-- PERCENTILE CONT, PERCENTILE DISC--LAG FunctionSELECT DISTINCT Name AS DepartmentNameSELECT BusinessEntityID,,PERCENTILE CONT(0.5) WITHIN GROUP (ORDER BY ph.Rate)YEAR(QuotaDate) AS SalesYear,OVER (PARTITION BY Name) AS MedianCont,PERCENTILE DISC(0.5) WITHIN GROUP (ORDER BY ph.Rate)OVER (PARTITION BY Name) AS MedianDiscFROM HumanResources.Department AS dSalesQuota AS CurrentQuota,LAG(SalesQuota, 1,0) OVER (ORDER BY YEAR(QuotaDate)) AS PreviousQuotaFROM Sales.SalesPersonQuotaHistoryWHERE BusinessEntityID 275 and YEAR(QuotaDate) IN ('2005','2006');INNER JOIN HumanResources.EmployeeDepartmentHistory AS dhON dh.DepartmentID d.DepartmentIDINNER JOIN HumanResources.EmployeePayHistory AS phON ph.BusinessEntityID dh.BusinessEntityIDWHERE dh.EndDate IS NULL;DepartmentName-------------------Document ControlEngineeringExecutiveHuman 557716.5865BusinessEntityID SalesYear CurrentQuotaPreviousQuota---------------- ----------- --------------------- 24000.001429000.00

Azure Synapse Analytics SQL Windowing Functions (continued)ROWS RANGEPRECEDING, UNBOUNDING PRECEDING, CURRENT ROW,BETWEEN, FOLLOWING, UNBOUNDED FOLLOWING-- Fi

Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

Related Documents:

SQL Server 20XX: On-Premises or Azure Virtual Machine (VM) Azure SQL Database: Feature-rich relational database service . Serverless SQL Pools is a SQL-based query service built into Azure Synapse Analytics that allows reading and writing CSV, Parquet and JSON data stored within Azure Data Lake Gen2.

You need to collect and automatically analyze security events from Azure Active Directory (Azure AD). What should you use? A. Azure Sentinel B. Azure Key Vault C. Azure Synapse Analytics D. Azure AD Connect Answer: A Question: 93 HOTSPOT For each of the following statements, select Yes if the statement is true. Otherwise, select No. NOTE: Each .

build a staging area for data warehouse, or build machine learning models on the Analytical data by replicating data from Dataverse into Azure SQL Database/ Azure SQL VM. . Implement integrations to sync the data from Azure Synapse Analytics to your pre-existing data warehousing solution. Pilot the implementation with a

Azure Synapse Analytics Azure Synapse helps you to ingest, explore, prepare, transform, manage, and serve data for BI, and machine learning needs. It gives you the freedom to query data on your terms, using either server less or dedicated options—at scale. Azure Data Lake Our certified and experienced consultants let organizations

AZURE TAGGING BEST PRACTICES Adding tags to your Azure resources is very simple and can be done using Azure Portal, Azure PowerShell, CLI, or ARM JSON templates. You can tag any resources in Azure, and using this service is free. The tagging is done on the Azure platform level and does not impact the performance of the resource in any way.

DE LAS UNIDADES PROGRAMA CURRICULAR UNIDAD 2 - Introduccion a los servicios de azure - Los servicios de Azure - Cómo crear un App Service en Azure - Administrar App Service con Azure Cloud Shell Azure UNIDAD 3 - Introduccion al Modulo - Regiones y centros de datos en azure - Zonas Geograficas en

Resource Manager and the Azure portal through Azure Arc to facilitate resource management at a global level. This also means a single vendor for support and billing. Save time and resources with regular and consistent feature and security updates. Access Azure hybrid services such as Azure Security Center, Azure Backup, and Azure site recovery.

students solve a variety of challenges faced in education through Microsoft Azure and the cloud. Azure for research staff Azure for teaching staff Azure for students Azure for academic institutions Azure is a powerful tool for research and education, and Microsoft provides a number of programs to meet the needs of academic institutions.