Best Practices For Managing Unstructured Data

1y ago
2 Views
1 Downloads
7.15 MB
10 Pages
Last View : 28d ago
Last Download : 3m ago
Upload by : Rafael Ruffin
Transcription

BEST PRACTICES FORMANAGING UNSTRUCTURED DATA

2Executive Summary3The Evolution of Data Management5Challenges for Unstructured Data Management and Analytics7What an Effective Solution Looks Like8A Global, Unified System for Managing Unstructured Data9About Snowflake

CHAMPION GUIDESEXECUTIVESUMMARYUnstructured data accounts for a vast andrapidly growing amount of information.According to Computer Weekly, four-fifths ofall business-relevant information—mostly text(for example, emails, reports, articles, customerreviews, client notes, and social media posts) butalso audio, video, and remote system monitoringdata—originates in unstructured data.¹However, unstructured data poses a number ofchallenges for organizations attempting to extractvalue from it using legacy data management tools. Itis not easy to search, analyze, or query—especially onthe fly. Its complexity creates processing problemsfor extracting analytical insights. Poor visibility andcontrol create other issues with regard to governanceand data security.A modern data management platform that caneffectively incorporate unstructured data (along withstructured and semi-structured files) offers valuableadvantages such as more complete data analysisand better insights for decision-making. An effectivesolution must include three core capabilities: It shouldeliminate data silos; provide fast and flexible dataprocessing; and ensure easy, secure access.2

CHAMPION GUIDESTHE EVOLUTION OFDATA MANAGEMENTThe ability to analyze data is why businesses,governments, and other organizations invest incomputers. Extracting insights to gain tacticaland strategic advantages has always been thegoal. The first computers were essentially solvinglong, hard math problems on the small amountsof raw data available at that time.But today, data arrives from diverse sources inmassive amounts and can appear in any form—structured, semi-structured, or unstructured.Traditional data management technologies are unableto consistently support multiple data formats, causingorganizations to seek out new methods for gettingmaximum value from all of their data.environments for storing and managing this kind ofdata. At this time, most data analysis was limited tostructured data, because the data was well organizedand could be easily read by analytics algorithms.As generally defined, semi-structured data doesnot obey the tabular structure of table-based datamanagement systems developed for and by humans,but it does contain tags or other markers to separatesemantic elements and enforce hierarchies.2SEMI-STRUCTURED DATAThe rapid decrease in the cost of storing data andthe growth in distributed systems led to an explosionof machine-generated data. Semi-structured dataformats such as JSON, Avro, and others became thede facto form in which this data is sent and stored.This data was always intended to be more machinefriendly—both in how it’s generated and how it wouldlater be processed programmatically.Each form of data is important, and all must be usedto form a full analytical picture.STRUCTURED DATAConventional data management systems weredesigned decades ago, when data arrived in verypredictable, structured formats. Relational data withfixed schemas was the norm because data sourceswere limited and didn’t change very often. Tablebased data warehouses offered highly controlled3

CHAMPION GUIDESData lakes emerged over the last decade and madeit easier to manage semi-structured data. Morerecently, some organizations have relied on a mix oftable-based and file-based management systems.defined by the fact that it is not organized in a predefined manner—which results in irregularities andambiguities that make it difficult to manage, secure,govern, and process using traditional approaches,according to Wikipedia.4UNSTRUCTURED DATAExamples of unstructured include digital files thatcontain complex data such as images, videos,audio, and .pdf documents. It also includes manyindustry-specific file formats: DICOM (medicalimaging); .vcf (genomics); .kdf (semiconductors);and .hdf5 (aerospace).While data lakes expand management and analyticsto more kinds of data, these architectures don’twork well for the rapidly expanding quantities ofunstructured data that businesses are now collecting.There has been a rapid increase in the amountof unstructured data that needs to be analyzed.According to IDC projections reported by AnalyticsInsight, 80% of the world’s data will be unstructuredby 2025—and just 0.5% of these resources are beinganalyzed and used today.³Humans natively create unstructured data. In thesame way that machines interacting with the worldcreate huge volumes of semi-structured data,humans interacting with organizations create a hugevolume of unstructured data. Unstructured data isUnstructured data is widely regarded as an untappedresource for feeding customer analytics andmarketing intelligence applications. While there’s vastpotential for extracting value from unstructured data,its complexity and the sheer volume of informationbeing generated requires a new evolutionary step inhow this kind of data is managed. Organizations needan easy way to access, process, and govern theirstores of unstructured files.HOW CAN UNSTRUCTUREDDATA HELP YOU?When done well, incorporating unstructureddata into your analytics and decision-makingcan open up a new perspective for yourorganization—as well as new opportunities.Here are a few examples of what you can dowith unstructured data: Analyzing customer behavior on socialmedia to inform targeted marketingcampaigns by identifying specific regionsor the demographics of customers who aretalking about a specific product. Expediting automobile insurance claimsprocessing by automatically applyingmachine learning (ML) to image files forpattern recognition. Analyzing call center audio recordings toderive marketing insights such assentiment analysis. Scanning doctors’ handwritten notes forterms that could indicate good clinical trialcandidates and joining that information withstructured data to identify and register trialcandidates faster.44

Traditional data management systems (that is,data warehouses and data lakes) aren’t able tosupport all of the workload demands for today’sdata volume, velocity, and variety of formats.As a result, these systems have to add differenttools to support the various types of data(structured, semi-structured, and unstructured).According to DCIG, “Manyorganizations now need tomanage multiple petabytes ofdata. At petabyte scale, storing,protecting, backing it up, andrecovering it all is problematicusing legacy solutions.” 5Blob storage services provided by the public cloudproviders (such as Amazon S3 and Azure BlobContainers) have become the default storage forunstructured data files. However, these have manylimitations for analytics use cases. For example, listingfiles in blob storage can be challenging and limited toonly prefix-based searches. Without the formal tablebased or file-based organizational system to helpguide data storage, consistently accessing, managing,controlling, searching for, and securing unstructureddata with these services becomes much more difficult.CHAMPION GUIDESCHALLENGES FOR UNSTRUCTURED DATAMANAGEMENT AND ANALYTICSUNSTRUCTURED DATA COMPLEXITYUnstructured data itself is complex and hard toanalyze. The different file formats that make up astored body of unstructured data are also separate,and it can be difficult to make cohesive sense of theassembled set of information.Joining unstructured data sources with other dataformats or data sets is particularly challenging—especially if the unstructured data involves audioand video media files. These issues lead to siloed andunused data. When data is stuck in silos, organizationsexperience limited query performance due to poorvisibility, and some data is entirely inaccessible.DATA PROCESSING ISSUESReliance on disparate data management toolsand systems also creates complex data pipelinesthat degrade analytics performance. Convertingunstructured data to structured data by extractingtext from PDF files or using image recognitionsoftware can be cumbersome, compute-intensive,and time-consuming. Relying on legacy solutionsfor managing unstructured data leads to processingproblems such as broken data pipelines and errorprone data movement due to frequent copying ofdata from one place to another. It also slows digitaltransformation efforts, preventing you from seeingthe intended business impact of data operations andfulfilling the organization’s goals.5

When high volumes of complex unstructured datacombine with the rigid architectures of traditionaldata systems, managing data access becomes verydifficult. This is especially true when it comes tolimiting access based on the specific type of dataand the user’s role (necessary for implementing “zerotrust” security controls).According to Security Weekly,government cybersecurityexperts have clearly settledon moving to the cloud andimplementing a zero-trustarchitecture as being the twomost immediate and practicalmethods to improve the nation’scybersecurity posture. 6Data privacy laws—such as the EU’s General DataProtection Regulation (GDPR)—don’t distinguishbetween structured and unstructured data.Regardless of its form, data that contains privateinformation must remain under the control andprotection of an organization at all times. Accordingto CPO Magazine, GDPR fines jumped by 39% in2020 and the total fine count as of January 2021 forEuropean Union member states totals about 332.4million USD.7Gartner predicted that, “By2023, 65% of the world’spopulation will have itspersonal data covered undermodern privacy regulations,up from 10% in 2020.” 8CHAMPION GUIDESGOVERNANCE ANDSECURITY UNCERTAINTIESSpecific governance and security issues surroundingunstructured data include: Migrating existing permissions. Unstructureddata is often sourced from other platforms wherethe files already have complex permissionsrelated to those systems. Understanding thosepermissions is complex, and then mapping themto a new platform is extremely challenging. Data sharing. According to Verizon, 61% of databreaches last year involved credentials, and 25%specifically used stolen credentials.9 How can anorganization give users access to data withoutgiving them credentials? Risks with data movement. Siloed data that iscopied and then resides in multiple places createsa lot of unnecessary risk exposure. Right to be forgotten. Data that’s inaccessibleor that has been copied across disparatemanagement architectures can be difficult to fullyexpunge to maintain compliance with differentregional data privacy laws. This introduces the riskof regulatory fines and potential litigation costs.6

CHAMPION GUIDESWHAT AN EFFECTIVESOLUTION LOOKS LIKEStoring and governing unstructured data is oneof the most important tasks for data architectureadministrators. An effective solution formanaging unstructured data should include builtin capabilities to store, access, process, govern,secure, and share an ever-expanding volume ofthis data. As such, the system must specificallydeliver sufficient performance, concurrency, andscale while solving the critical shortcomings ofthe legacy approaches in place today.solution must provide fast, reliable performancewithout needing manual tuning or without causingworkload contention. It should offer elastic workloadconcurrency via cost-efficient scalability across anyvolume of users, jobs, or data.NO DATA SILOSEASY, SECURE ACCESSModern data management needs to be based ona single, cloud-based platform that supports alldata formats (structured, semi-structured, andunstructured) to easily store, access, process, share,and analyze files. Data engineers should be able tostore and retrieve files in a cloud-agnostic way—sodata is accessible across clouds and regions—whilestill enforcing unified policies.Finally, the solution must enable users toconveniently search and share their unstructureddata. It should include a built-in file catalog for quicklylocating files in their stages. The solution should alsosupport scoped access: the ability to create secureviews on catalogs and share those secure views withother accounts without making physical copies orsharing credentials for access to physical files.In addition to compute performance, data scientistsneed to be free to work with their tools of choiceto process unstructured data, maximizing theirproductivity. Also, to ensure a continuous datapipeline, the solution needs to make its outputs easilyavailable and transparent for others to use.Organizations need governance at scale withflexible policies that follow the data for consistentenforcement across users and workloads. In supportof zero-trust requirements, a data managementsolution must help control access to sensitive data asappropriate for a user’s defined role. To achieve this,governance for unstructured files should use cloudagnostic role-based access control (RBAC) commands(such as simple GRANT and REVOKE statements).This avoids the potential complexities of security orgovernance policies in each cloud provider’s identityand access management (IAM) system.The solution should use a simplified architecture tohelp reduce maintenance and management overhead.It should also offer flexibility to store unstructureddata files in either an internal or external stage.FAST, FLEXIBLE PROCESSINGA modern management solution depends onample processing capabilities that can transform,prepare, and enrich unstructured data to extractmore-complete insights using complex analytics,data science, and interactive applications. The7

CHAMPION GUIDESA GLOBAL, UNIFIED SYSTEM FORMANAGING UNSTRUCTURED DATAThe next evolutionary step in data managementshould be defined by how all forms of moderndata can be shared and consumed—not just byinternal teams, but by customers and partnersas well—to extract maximum value.To achieve this, organizations need a global, unifiedsystem for connecting companies and data providersto the most-relevant data for their business. Aneffective solution must combine structured, semistructured, and unstructured data and providea single and seamless experience for storing,processing, and analyzing data across public cloudsThe best practices in this ebook will help you startmaximizing the value of all your data today. To learnmore about how you can store, access, process,govern, and share unstructured data in a singledata platform, watch our 7 Ways to Start UsingUnstructured Data in Snowflake webinar (support forunstructured data currently in preview).8

ABOUT SNOWFLAKESnowflake delivers the Data Cloud—a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance.Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Whereverdata or users live, Snowflake delivers a single and seamless experience across multiple public clouds. Snowflake’s platform is the engine that powers andprovides access to the Data Cloud, creating a solution for data warehousing, data lakes, data engineering, data science, data application development, and datasharing. Join Snowflake customers, partners, and data providers already taking their businesses to new frontiers in the Data Cloud. Snowflake.com 2021 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and servicenames mentioned herein are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All otherbrand names or logos mentioned or used herein are for identification purposes only and may be the trademarks of their respectiveholder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s).CITATI O NS1 bit.ly/3if52aK6 bit.ly/2WlxxvU2 wikipedia.org/wiki/Semi-structured data7 bit.ly/3if6V7A3 bit.ly/2XQufkz8 gtnr.it/2XSRzOC4 wikipedia.org/wiki/Unstructured data9 z.to/3zLjyx85 bit.ly/3CIqnkS

govern, and process using traditional approaches, according to Wikipedia.4 Examples of unstructured include digital files that contain complex data such as images, videos, audio, and .pdf documents. It also includes many industry-specific file formats: DICOM (medical imaging); .vcf (genomics); .kdf (semiconductors); and .hdf5 (aerospace).

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Switch and Zoning Best Practices 28-30 2. IP SAN Best Practices 30-32 3. RAID Group Best Practices 32-34 4. HBA Tuning 34-38 5. Hot Sparing Best Practices 38-39 6. Optimizing Cache 39 7. Vault Drive Best Practices 40 8. Virtual Provisioning Best Practices 40-43 9. Drive

och krav. Maskinerna skriver ut upp till fyra tum breda etiketter med direkt termoteknik och termotransferteknik och är lämpliga för en lång rad användningsområden på vertikala marknader. TD-seriens professionella etikettskrivare för . skrivbordet. Brothers nya avancerade 4-tums etikettskrivare för skrivbordet är effektiva och enkla att

Den kanadensiska språkvetaren Jim Cummins har visat i sin forskning från år 1979 att det kan ta 1 till 3 år för att lära sig ett vardagsspråk och mellan 5 till 7 år för att behärska ett akademiskt språk.4 Han införde två begrepp för att beskriva elevernas språkliga kompetens: BI