Building A Data Warehouse Dimensional Model Azure Synapse Analytics .

1y ago
17 Views
2 Downloads
538.13 KB
14 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Olive Grimm
Transcription

Building a Data Warehouse Dimensional ModelusingAzure Synapse Analytics Serverless SQL PoolAzure Synapse AnalyticsData Toboggan – Saturday 30th January 2021Andy CutlerLightning Session

Andy CutlerIndependent Consultant & ContractorAzure Data Platform & Power rhttps://www.linkedin.com/in/andycutler/

Session Overview Question What is the Synapse Serverless SQL Pool? What is the Dimensional Model? Initial Load Dimensions Initial Load Facts Incremental Loading Reading the Dimensional Data Considerations

Can I.?Question Can I build a Data Warehouse using the Dimensional Modellingtechnique and use Azure Synapse Serverless SQL Pool as theprocessing engine?Why do I want to do this?What would I usually do? I would like to leverage my existing SQL skills SQL Server 20XX: On-Premises or Azure Virtual Machine (VM) I’m comfortable with Dimensional modelling theory Azure SQL Database: Feature-rich relational database service I’d like to leverage the flexibility of the Data Lake Synapse Analytics Dedicated SQL Pools (AKA SQL Data Warehouse) I’m interested in exploring the Data Lakehouse concept

What is Synapse Serverless SQL Pool?Serverless SQL Pools is a SQL-based query service built into Azure Synapse Analytics that allows reading andwriting CSV, Parquet and JSON data stored within Azure Data Lake Gen2.Cost model is based on amount of data processed .currently 3.727 per 1TBPart of a suite of services within the overall Azure SynapseService which also includes Dedicated SQL Pools, Pipelines (DataData LakeSQL ServerlessReadData LakeWriteFactory) & Power BI.Serverless SQL Pools support familiar SQL objects: Create Databases to store objects SQL syntax to write data transformations Stored Procedures to encapsulate logicNo data is stored within the Serverless SQL Pool

What is the Dimensional Model?The Dimensional model is a Data Warehouse modelling process that has existed for many years (30 ?!!)It allows the modelling of data as eithera “measurement” or a “label” of a derDateKeyOrderDateKeyDate AttributesCustomerKeyShipToAddressKey There are 2 basic types of data: Dimensions: The business reference data. E.G. Dates and Products Facts: Measurements of a business process E.G. Sales Also known as a Star Schema Well-known and popular Data Warehouse methodologyCustomerKeyCustomer AttributesDimAddressAddressKeyAddress Attributes

Setup the Azure Synapse EnvironmentWe’ll setup the environment by creating a SQL Serverless database, schema, security, external datasources and file formats123Parquet is a columnarfile format whichstores the data,schema and statisticsCSV format to matchthe source data

Create External TablesCreating External Tables that point to the source CSV initial load location.For each table used for a DimensionFor each table used in a FactCreate External table in StagingDim Schema pointing tosource CSV file in “initial” folder.Create External table in StagingFact Schema pointing tosource CSV file in “initial” folder.

Initial Dimension LoadingExtracting source data from the CSV file and loading to the Data LakeSELECTWe now use the CREATE TABLE AS SELECT(CETAS) syntax to select the data from thesource CSV and write the transformed datainto the Data Lake as a Parquet fileThe initial load contains all of the currentcustomer rows from the source database.CREATEWe can use a ROW NUMBER() function togenerate a Surrogate key for each dimensionwhich will be used in the Fact table.The Dimension data is loaded into asequence number folder structure.

Initial Fact LoadingExtracting source data from the CSV file and loading to the Data LakeSELECTAs with the Dimension load, the CREATETABLE AS SELECT (CETAS) syntax is used toselect the data from the source CSV andwrite the transformed data into the DataLake as a Parquet fileCREATEThe initial load contains 3 days of Sales Datain a single CSV file.Use CETAS statement to load the data to theData Lake in a Year-Month-Day “partitioned”structure.To extract each day, the whole CSV will needto be read by SQL Serverless.The “Date” (highlighted) can beparameterised using Dynamic SQL

Incremental LoadingWe can incrementally load data by using the “partition” folder structureSELECTCREATEWe are treating the CETAS statement as a“staging” process to write the data to theData Lake.We can drop the External Table and the datawill persist in the Data Lake.The process is contained within a StoredProcedure in the Serverless SQL Pooldatabase.The location to write the Parquet data to isparameterised along with the filter to SELECTthe appropriate data.

Reading the Dimensional DataWe can SELECT data from the “partitions” that were created when running the CETAS process.Selecting all the data availableUsing wildcards in the location to recursively selectData from all the sub-foldersSelecting a specific “partition” of dataUsing the “filepath” function to select a specific folder of data

Considerations The cost model for both Reading and Writing data is based on the amount of dataprocessed, not time or processing power. No caching of data retrieved, the same query touching the same data will incur costs. Currently data is immutable, it cannot be UPDATEd so your processes must take thisinto consideration

References & Further ReadingSynapse Analytics SQL Serverless -external-table-as-selectData Lakehouse ata-lakehouse.htmlDimensional Modelling sional-modeling-techniques/Parquet File Format https://parquet.apache.org/documentation/latest/

SQL Server 20XX: On-Premises or Azure Virtual Machine (VM) Azure SQL Database: Feature-rich relational database service . Serverless SQL Pools is a SQL-based query service built into Azure Synapse Analytics that allows reading and writing CSV, Parquet and JSON data stored within Azure Data Lake Gen2.

Related Documents:

Management under Master Data Define Warehouse Numbers. 2. Check the warehouse number assignment in Customizing for Extended Warehouse Management under Master Data Assign Warehouse Numbers. 3. Check the warehouse number control in Customizing for Extended Warehouse Management under Master Data Define Warehouse Number Control.

location: fort worth, tx warehouse status: approved county: tarrant warehouse capacity: 85,000 warehouse code: 853007 001 location(s) warehouse name: eugene b smith & company , inc license type: unlicensed location: galveston, tx warehouse status: approved county: galveston warehouse capacity: 37,180 warehouse code: 858054 001 location(s)

location: fort worth, tx warehouse status: approved county: tarrant warehouse capacity: 85,000 warehouse code: 853007 001 location(s) warehouse name: eugene b smith & company , inc license type: unlicensed location: galveston, tx warehouse status: approved county: galveston warehouse capacity: 37,180 warehouse code: 858054 001 location(s)

1.3 Common Data Warehouse Tasks 1-4 1.4 Data Warehouse Architectures 1-5 1.4.1 Data Warehouse Architecture: Basic 1-5 1.4.2 Data Warehouse Architecture: with a Staging Area 1-6 1.4.3 Data Warehouse Architecture: with a Staging Area and Data Marts 1-6 2 Data Warehousing Logical Design 2.1 Logical Versus Physical Design in Data Warehouses 2-1

Inventory data Warehouse Outgoing Inventory IoT Cloud gathers warehouse inventory data from Warehouse IoT Cloud gathers dispatched inventory data from Warehouse . Based on the warehouse floor design, budget, type of industry and materials , suitable option or combination of options possible to choose.

business value of a data warehouse is for the business owners. In Section 2.3 we discuss different data models and the major building blocks in a data warehouse. In Section 2.4 we discuss different operations required to implement a data warehouse. In Section 2.5 we discuss how we can use the data warehouse for reporting, and we summarize in Sec-

The following table maps standard data-warehouse concepts to those in BigQuery: Data warehouse BigQuery Data warehouse The BigQuery service replaces the typical hardware setup for a traditional data warehouse. That is, it serves as a collective home for all analytical data in an organization. Data mart Datasets are collections of tables that .

a modern data warehouse: The data warehouse is unable to keep up with explosive volumes. The data warehouse is falling behind the velocity of real-time performance requirements. The data warehouse is slower than desired in adopting a variety of new data sources, slowing time-to-value The platform costs more, while performance lags.