Relational Database Design And Implementation For Biodiversity . - Athro

1y ago
28 Views
2 Downloads
1.96 MB
66 Pages
Last View : 1d ago
Last Download : 3m ago
Upload by : Evelyn Loftin
Transcription

PhyloInformatics 7: 1-66 - 2005Relational Database Design andImplementation for BiodiversityInformaticsPaul J. MorrisThe Academy of Natural Sciences1900 Ben Franklin Parkway, Philadelphia, PA 19103 USAReceived: 28 October 2004 - Accepted: 12 January 2005AbstractThe complexity of natural history collection information and similar information within the scopeof biodiversity informatics poses significant challenges for effective long term stewardship of thatinformation in electronic form. This paper discusses the principles of good relational databasedesign, how to apply those principles in the practical implementation of databases, andexamines how good database design is essential for long term stewardship of biodiversityinformation. Good design and implementation principles are illustrated with examples from therealm of biodiversity information, including an examination of the costs and benefits of differentways of storing hierarchical information in relational databases. This paper also discussestypical problems present in legacy data, how they are characteristic of efforts to handle complexinformation in simple databases, and methods for handling those data during data migration.IntroductionThe data associated with natural historycollection materials are inherently complex.Management of these data in paper formhas produced a variety of documents suchas catalogs, specimen labels, accessionbooks, stations books, map files, field notefiles, and card indices. The simpleappearance of the data found in any one ofthese documents (such as the columns foridentification, collection locality, datecollected, and donor in a handwrittencatalog ledger book) mask the inherentcomplexity of the information. Theappearance of simplicity overlying highlycomplex information provides significantchallenges for the management of naturalhistory collection information (and othersystematic and biodiversity information) inelectronic form. These challenges includemanagement of legacy data producedduring the history of capture of naturalhistory collection information into databasemanagement systems of increasingsophistication and complexity.In this document, I discuss some of theissues involved in handling complexbiodiversity information, approaches to thestewardship of such information in electronicform, and some of the tradeoffs betweendifferent approaches. I focus on the verywell understood concepts of relationaldatabase design and implementation.Relational1 databases have a strong(mathematical) theoretical foundation1Object theory offers the possibility of handling muchof the complexity of biodiversity information in objectoriented databases in a much more effective mannerthan in relational databases, but object oriented andobject-relational database software is much lessmature and much less standard than relationaldatabase software. Data stored in a relational DBMSare currently much less likely to become trapped in adead end with no possibility of support than data in anobject oriented DBMS.1

PhyloInformatics 7: 2-66 - 2005(Codd, 1970; Chen, 1976), and a widerange of database software productsavailable for implementing relationaldatabases.Figure 1. Typical paths followed by biodiversityinformation. The cylinder represents storage ofinformation in electronic form in a database.The effective management of biodiversityinformation involves many competingpriorities (Figure 1). The most importantpriorities include long term datastewardship, efficient data capture (e.g.Beccaloni et al., 2003), creating high qualityinformation, and effective use of limitedresources. Biodiversity information storagesystems are usually created and maintainedin a setting of limited resources. The mostappropriate design for a database to supportlong term stewardship of biodiversityinformation may not be a complex highlynormalized database well fitted to thecomplexity of the information, but rathermay be a simpler design that focuses on themost important information. This is not tosay that database design is not important.Good database design is vitally importantfor stewardship of biodiversity information.In the context of limited resources, gooddesign includes a careful focus on whatinformation is most important, allowingprogramming and database administrationto best support that information.Database Life CycleAs natural history collections data havebeen captured from paper sources (such ascentury old handwritten ledgers) and haveaccumulated in electronic databases, thenatural history museum community hasobserved that electronic data need muchmore upkeep than paper records (e.g.National Research Council, 2002 p.62-63).Every few years we find that we need tomove our electronic data to some newdatabase system. These migrations areusually driven by changes imposed upon usby the rapidly changing landscape ofoperating systems and software.Maintaining a long obsolete computerrunning a long unsupported operatingsystem as the only means we have to workwith data that reside in a long unsupporteddatabase program with a custom front endwritten in a language that nobody writescode for anymore is not a desirablesituation. Rewriting an entire collectionsdatabase system from scratch every fewyears is also not a desirable situation. Thecomputer science folks who think aboutdatabases have developed a conceptualapproach to avoiding getting stuck in suchunpleasant situations – the database lifecycle (Elmasri and Navathe, 1994). Thedatabase life cycle recognizes that databasemanagement systems change over time andthat accumulated data and user interfacesfor accessing those data need to bemigrated into new systems over time.Inherent in the database life cycle is theinsight that steps taken in the process ofdeveloping a database substantially impactthe ease of future migrations.A textbook list (e.g. Connoly et al., 1996) ofstages in the database life cycle runssomething like this: Plan, design,implement, load legacy data, test,operational maintenance, repeat. In slightlymore detail, these steps are:1. Plan (planning, analysis, requirementscollection).2. Design (Conceptual database design,leading to information model, physicaldatabase design [including systemarchitecture], user interface design).3. Implement (Database implementation,user interface implementation).4. Load legacy data (Clean legacy data,transform legacy data, load legacydata).5. Test (test implementation).6. Put the database into production useand perform operational maintenance.7. Repeat this cycle (probably every tenyears or so).Being a visual animal, I have drawn adiagram to represent the database life cycle(Figure 2). Our expectation of databasesshould not be that we capture a largequantity of data and are done, but ratherthat we will need to cycle those data2

PhyloInformatics 7: 3-66 - 2005Figure 2. The Database Life Cyclethrough the stages of the database life cyclemany times.In this paper, I will focus on a few parts ofthe database life cycle: the conceptual andlogical design of a database, physicaldesign, implementation of the databasedesign, implementation of the user interfacefor the database, and some issues for themigration of data from an existing legacydatabase to a new design. I will provideexamples from the context of natural historycollections information. Plan ahead. Gooddesign involves not just solving the task athand, but planning for long termstewardship of your data.Levels and architectureA requirements analysis for a databasesystem often considers the networkarchitecture of the system. The differencebetween software that runs on a singleworkstation and software that runs on aserver and is accessed by clients across anetwork is a familiar concept to most usersof collections information. In some cases, adatabase for a collection running on a singleworkstation accessed by a single userprovides a perfectly adequate solution forthe needs of a collection, provided that theworkstation is treated as a server with anuninterruptible power supply, backupdevices and other means to maintain theintegrity of the database. Any computerrunning a database should be treated as aserver, with all the supporting infrastructurenot needed for the average workstation. Inother cases, multiple users are capturingand retrieving data at once (either locally orglobally), and a database system capable ofrunning on a server and being accessed bymultiple clients over a network is necessaryto support the needs of a collection orproject.It is, however, more helpful for anunderstanding of database design to thinkabout the software architecture. That is, tothink of the functional layers involved in adatabase system. At the bottom level is theDBMS (database management system [see3

PhyloInformatics 7: 4-66 - 2005glossary, p.64]), the software that runs thedatabase and stores the data (layeredbelow this is the operating system and itsfilesystem, but we can ignore these fornow). Layered above the DBMS is youractual database table or schema layer.Above this may be various code andnetwork transport layers, and finally, at thetop, the user interface through which peopleenter and retrieve data (Figure 29). Somedatabase software packages allow easyseparation of these layers, others aremonolithic, containing database, code, andfront end into a single file. A databasesystem that can be separated into layerscan have advantages, such as multiple userinterfaces in multiple languages over asingle data source. Even for monolithicdatabase systems, however, it is helpful tothink conceptually of the table structuresyou will use to store the data, code that youwill use to help maintain the integrity of thedata (or to enforce business rules), and theuser interface as distinct components,distinct components that have their ownplaces in the design and implementationphases of the database life cycle.Relational Database DesignWhy spend time on design? The answer issimple:Poor Design Time GarbageAs more and more data are entered into apoorly designed database over time, and asexisting data are edited, more and moreerrors and inconsistencies will accumulatein the database. This may result in bothentirely false and misleading dataaccumulating in the database, or it mayresult in the accumulation of vast numbersof inconsistencies that will need to becleaned up before the data can be usefullymigrated into another database or linked toother datasets. A single extremely carefuluser working with a dataset for just a fewyears may be capable of maintaining cleandata, but as soon as multiple users or morethan a couple of years are involved, errorsand inconsistencies will begin to creep intoa poorly designed database.Thinking about database design is useful forboth building better database systems andfor understanding some of the problems thatexist in legacy data, especially thoseentered into older database systems.Museum databases that begandevelopment in the 1970s and early 1980sprior to the proliferation of effective softwarefor building relational databases were oftenwritten with single table (flat file) designs.These legacy databases retain artifacts ofseveral characteristic field structures thatwere the result of careful design efforts toboth reduce the storage space needed bythe database and to handle one to manyrelationships between collection objects andconcepts such as identifications.Information modelingThe heart of conceptual database design isinformation modeling. Information modelinghas its basis in set algebra, and can beapproached in an extremely complex andmathematical fashion. Underlying thiscomplexity, however, are two core concepts:atomization and reduction of redundantinformation. Atomization means placingonly one instance of a single concept in asingle field in the database. Reduction ofredundant information means organizing adatabase so that a single text stringrepresenting a single piece of information(such as the place name DemocraticRepublic of the Congo) occurs in only asingle row of the database. This one row isthen related to other information (such aslocalities within the DRC) rather than eachrow containing a redundant copy of thecountry name.As information modeling has a firm basis inset theory and a rich technical literature, it isusually introduced using technical terms.This technical vocabulary include terms thatdescribe how well a database designapplies the core concepts of atomizationand reduction of redundant information (firstnormal form, second normal form, thirdnormal form, etc.) I agree with Hernandez(2003) that this vocabulary does not makethe best introduction to informationmodeling2 and, for the beginner, masks theimportant underlying concepts. I will thus2I do, however, disagree with Hernandez'entirely free form approach to databasedesign.4

PhyloInformatics 7: 5-66 - 2005describe some of this vocabulary only afterexamining the underlying principles.Atomization1) Place only one concept in eachfield.Legacy data often contain a single field fortaxon name, sometimes with the author andyear also included in this field. Considerthe taxon name Palaeozygopleurahamiltoniae (HALL, 1868). If this name isplaced as a string in a single field“Palaeozygopleura hamiltoniae (Hall,1868)”, it becomes extremely difficult to pullthe components of the name apart to, say,display the species name in italics and theauthor in small caps in an html document: em Palaeozygopleura hamiltoniae /em (H font size -2 ALL /font , 1868), or toassociate them with the appropriate tags inan XML document. It likewise is muchharder to match the search criteriaGenus Loxonema and Trivial hamiltoniaeto this string than if the components of thename are separated into different fields. Ataxon name table containing fields forGeneric name, Subgeneric name, TrivialEpithet, Authorship, Publication year, andParentheses is capable of handling mostidentifications better than a single text field.However, there are lots more complexities –subspecies, varieties, forms, cf., near,questionable generic placements,questionable identifications, hybrids, and soforth, each of which may need its own fieldto effectively handle the wide range ofdifferent variations of taxon names that canbe used as identifications of collectionobjects. If a primary purpose of the data setis nomenclatural, then substantial thoughtneeds to be placed into this complexity. Ifthe primary purpose of the data set is torecord information associated with collectionobjects, then recording the name used andindicators of uncertainty of identification arethe most important concepts.2) Avoid lists of items in a field.Legacy data often contain lists of items in asingle field. For example, a remarks fieldmay contain multiple remarks made atdifferent times by different people, or ageographic distribution field may contain alist of geographic place names. Forexample, a geographic distribution fieldmight contain the list of values “New York;New Jersey; Virginia; North Carolina”. Ifonly one person has maintained the data setfor only a few years, and they have beenvery careful, the delimiter “;” will separate allinstances of geographic regions in eachstring. However, you are quite likely to findthat variant delimiters such as “,” or “ ” or“:” or “'” or “l” have crept into the data.Lists of data in a single field are a commonlegacy solution to the basic informationmodeling concept that one instance of onesort of data (say a species name) can berelated to many other instances of anothersort of data. A species can be distributed inmany geographic regions, or a collectionobject can have many identifications, or alocality can have many collections madefrom it. If the system you have for storingdata is restricted to a single table (as inmany early database systems used in theNatural History Museum community), thenyou have two options for capturing suchinformation. You can repeat fields in thetable (a field for current identification andanother field for previous identification), oryou can list repeated values in a single field(hopefully separated by a consistentdelimiter).Reducing RedundantInformationThe most serious enemy of clean data inlong -lived database systems is redundantcopies of information. Consider a localitytable containing fields for country, primarydivision (province/state), secondary division(county/parish), and named place(municipality/city). The table will containmultiple rows with the same value for eachof these fields, since multiple localities canoccur in the vicinity of one named place.The problem is that multiple different textstrings represent the same concept anddifferent strings may be entered in differentrows to record the same information. Forexample, Philadelphia, Phil., City ofPhiladelphia, Philladelphia, and Philly are allvariations on the name of a particularnamed place. Each makes sense whenwritten on a specimen label in the context ofother information (such as country andstate), as when viewed as a single locality5

PhyloInformatics 7: 6-66 - 2005record. However, finding all the specimensthat come from this place in a database thatcontains all of these variations is not aneasy task. The Academy ichthyologycollection uses a legacy Muse databasewith this structure (a single table for localityinformation), and it contains some 16different forms of “Philadelphia, PA, USA”stored in atomized named place, state, andcountry fields. It is not a trivial task tosearch this database on locality informationand be sure you have located all relevantrecords. Likewise, migration of these datainto a more normal database requiresextensive cleanup of the data and is notsimply a matter of moving the data into newtables and fields.redundant information in a single localitytable), the other normalizing them intocountry and primary division tables andcreating a relationship between countriesand states.The core problem is that simple flat tablescan easily have more than one rowcontaining the same value. The goal ofnormalization is to design tables that enableusers to link to an existing row rather than toenter a new row containing a duplicate ofinformation already in the database.Table 1. A flat locality table.Rows in the single flat table, given time, willaccumulate discrepancies between thename of a country used in one row and adifferent text string used to represent thesame country in other rows. The problemarises from the redundant entry of theCountry name when users are unaware ofexisting values when they enter data andare freely able to enter any text string in therelevant field. Data in a flat file locality tablemight look something like those in Table 1:Locality id300301302303CountryUSAUSAUSAUnitedStatesPrimary DivisionMontanaPennsylvaniaNew YorkMassachusettsExamination of the values in individual rows,such as, “USA, Montana”, or “United States,Massachusetts” makes sense and is easilyintelligible. Trying to ask questions of thistable, however, is a problem. How manystates are there in the “USA”? The tablecan't provide a correct answer to thisquestion unless we know that “USA” and“United States” both occur in the table andthat they both mean the same thing.The same information stored cleanly in tworelated tables might look something likethose in Table 2:Table 2. Separating Table 1 into two relatedtables, one for country, the other for primarydivision (state/province/etc.).Figure 3. Design of a flat locality table (top) withfields for country and primary division comparedwith a pair of related tables that are able to linkmultiple states to one country without creatingredundant entries for the name of that country.The notation and concepts involved in theseEntity-Relationship diagrams are explained below.Country idName300USA301UgandaPrimary fk c country id Primary 02300New York303300MassachusettsContemplate two designs (Figure 3) forholding a country and a primary division (astate, province, or other immediatesubdivision of a country): one holdingcountry and primary division fields (withHere there is a table for countries that holdsone row for USA, together with a numericCountry id, which is a behind the scenesdatabase way for us to find the row in thetable containing “USA' (a surrogate numeric6

PhyloInformatics 7: 7-66 - 2005primary key, of which I will say more later).The database can follow the country id fieldover to a primary division table, where it isrecorded in the fk c country id field (aforeign key, of which I will also say morelater). To find the primary divisions withinUSA, the database can look at theCountry id for USA (300), and then find allthe rows in the primary division table thathave a fk c country id of 300. Likewise,the database can follow these keys in theopposite direction, and find the country forMassachusetts by looking up itsfk c country id in the country id field in thecountry table.Moving country out to a separate table alsoallows storage of a just one copy of otherpieces of information associated with acountry (its northernmost and southernmostbounds or its start and end dates, forexample). Countries have attributes(names, dates, geographic areas, etc) thatshouldn't need to be repeated each time acountry is mentioned. This is a central ideain relational database design – avoidrepeating the same information in more thanone row of a table.It is possible to code a variety of userinterfaces over either of these designs,including, for example, one with a picklist forcountry and a text box for state (as in Figure4). Over either design it is possible toenforce, in the user interface, a rule thatdata entry personnel may only pick anexisting country from the list. It is possibleto use code in the user interface to enforcea rule that prevents users from enteringPennsylvania as a state in the USA andthen separately entering Pennsylvania as astate in the United States. Likewise, witheither design it is possible to code a userinterface to enforce other rules such asconstraining primary divisions to thoseknown to be subdivisions of the selectedcountry (so that Pennsylvania is notFigure 4. Example data entry form using a picklistcontrol together with a text entry control.recorded as a subdivision of Albania).By designing the database with two relatedtables, it is possible to enforce these rules atthe database level. Normal data entrypersonnel may be granted (at the databaselevel) rights to select information from thecountry table, but not to change it. Higherlevel curatorial personnel may be grantedrights to alter the list of countries in thecountry table. By separating out the countryinto a separate table and restricting accessrights to that table in the database, thestructure of the database can be used toturn the country table into an authority fileand enforce a controlled vocabulary forentry of country names. Regardless of theuser interface, normal data entry personnelmay only link Pennsylvania as a state inUSA. Note that there is nothing inherent inthe normalized country/primary divisiontables themselves that prevents users whoare able to edit the controlled vocabulary inthe Country Table from entering redundantrows such as those below in Table 3.Fundamentally, the users of a database areresponsible for the quality of the data in thatdatabase. Good design can only assistthem in maintaining data quality. Gooddesign alone cannot ensure data quality.Table 3. Country and primary division tablesshowing a pair of redundant Country values.Country idName500USA501United StatesPrimaryDivision id300301302303fk c country id Primary Division500500500501MontanaPennsylvaniaNew YorkMassachusettsIt is possible to enforce the rules above atthe user interface level in a flat file. Thisenforcement could use existing values in thecountry field to populate a pick list of countrynames from which the normal data entryuser may only select a value and may notenter new values. Since this rule is onlyenforced by the programing in the userinterface it could be circumvented by users.More importantly, such a business ruleembedded in the user interface alone caneasily be forgotten and omitted when dataare migrated from one database system toanother.7

PhyloInformatics 7: 8-66 - 2005Normalized tables allow you to more easilyembed rules in the database (such asrestricting access to the country table tohighly competent users with a large stake inthe quality of the data) that make it harderfor users to degrade the quality of the dataover time. While poor design ensures lowquality data, good design alone does notensure high quality data.Good design thus involves carefulconsideration of conceptual and logicaldesign, physical implementation of thatconceptual design in a database, and gooduser interface design, with all else followingfrom good conceptual design.Entity-Relationship modelingUnderstanding the concepts to be stored inthe database is at the heart of gooddatabase design (Teorey, 1994; Elmasriand Navathe, 1994). The conceptual designphase of the database life cycle shouldproduce a result known as an informationmodel (Bruce, 1992). An information modelconsists of written documentation ofconcepts to be stored in the database, theirrelationships to each other, and a diagramshowing those concepts and theirrelationships (an Entity-Relationship or E-Rdiagram, ). A number of information modelsfor the biodiversity informatics communityexist (e.g. Blum, 1996a; 1996b; Berendsohnet al., 1999; Morris, 2000; Pyle 2004), mostare derived at least in part from theconcepts in ASC model (ASC, 1992).Information models define entities, listattributes for those entities, and relateentities to each other. Entities andattributes can be loosely thought of astables and fields. Figure 5 is a diagram of alocality entity with attributes for a mysteriouslocalityid, and attributes for country andprimary division. As in the example above,this entity can be implemented as a tablewith localityid, country, and primary divisionfields (Table 4).Table 4. Example locality data.Locality idCountry300USA301USAPrimary DivisionMontanaPennsylvaniaEntity-relationship diagrams come in avariety of flavors (e.g. Teorey, 1994). TheChen (1976) format for drawing E-Rdiagrams uses little rectangles for entitiesand hangs oval balloons off of them forattributes. This format (as in the distributionregion entity shown on the right in Figure 6below) is very useful for scribbling out draftsof E-R diagrams on paper or blackboard.Most CASE (Computer Aided SoftwareEngineering) tools for working withdatabases, however, use variants of theIDEF1X format, as in the locality entityabove (produced with the open source toolDruid [Carboni et al, 2004]) and thecollection object entity on the left in Figure 6(produced with the proprietary tool xCase[Resolution Ltd., 1998]), or the relationshipdiagram tool in MS Access. Variants of theIDEF1X format (see Bruce, 1992) drawentities as rectangles and list attributes forthe entity within the rectangle.Not all attributes are created equal. Thediagrams in Figures 5 and 6 list attributesthat have “ID” appended to the end of theirnames (localityid, countryid, collectionobjectid, intDistributionRegionID). Theseare primary keys. The form of this notationvaryies from one E-R diagram format toanother, being the letters PK, or anunderline, or bold font for the name of theprimary key attribute. A primary key can bethought of as a field that contains uniquevalues that let you identify a particular row ina table. A country name field could be theprimary key for a country table, or, as in theexamples here, a surrogate numeric fieldcould be used as the primary key.To give one more example of therelationship between entities as abstractconcepts in an E-R model and tables in adatabase, the tblDistributionRegion entityshown in Chen notation in Figure 6 could beimplemented as a table, as in Table 5, witha field for its primary key attribute,intDistributionRegionID, and a second fieldfor the region name attributevchrRegionName. This example is a portionof the structure of the table that holdsgeographic distribution area names in aBioLink database (additional fields hold therelationship between regions, allowingPennsylvania to be nested as a geographicregion within the United States nested withinNorth America, and so on).8

PhyloInformatics 7: 9-66 - 2005Figure 5. Part of a flat locality entity. Animplementation with example data is shown belowin Table 4.Figure 6. Comparison between entity and attributes as depicted in a typical CASE tool E-R diagram in avariant of the IDEF1X format (left) and in the Chen format (right, which is more useful for pencil and papermodeling). The E-R diagrams found in this paper have variously been drawn with the CASE tools xCaseand Druid or the diagram editor DiA.Table 5. A portion of a BioLink (CSIRO, 2001)tblDistributionRegion table.database applying a unique index to theprimary key.intDistributionRegionID15161718A primary key need not be a single attribute.A primary key can be a single attributecontaining real data (generic name), a groupof several attributes (generic name, trivialepithet, authorship), or a single attributecontaining a surrogate key (name id). Ingeneral, I recommend the use of surrogatenumeric primary keys for biodiversityinformatics information, because we are tooseldom able to be certain that otherpotential primary keys (candidate keys) willactually have unique values in real lvaniaThe key point to think about when designingdatabases is that things in the real worldcan be thought of in general terms asentities with attributes, and that informationabout these concepts can be stored in thetables and fields of a relational database. Ina further step, things in the real world canbe thought of as objects with properties thatcan do things (methods), and theseconcepts can be mapped in an object model(using an object modeling framework suchas UML) that can be implemented with anobject oriented language su

design, implementation of the database design, implementation of the user interface for the database, and some issues for the migration of data from an existing legacy database to a new design. I will provide examples from the context of natural history collections information. Plan ahead. Good design involves not just solving the task at

Related Documents:

The Relational Algebra A procedural query language Comprised of relational algebra operations Relational operations: Take one or two relations as input Produce a relation as output Relational operations can be composed together Each operation produces a relation A query is simply a relational algebra expression Six "fundamental" relational operations

Keywords: database, query, relational algebra, programming, SQL 1. INTRODUCTION Most commercial database systems are based on the relational data model. Recent editions of database textbooks focus primarily on the relational model. In this dual context, the relational model for data

The Relational Database Model 12 Retrieving Data 15 Advantages of a Relational Database 16 Relational Database Management Systems 18 Beyond the Relational Model 19 What the Future Holds 21 A Final Note 22 Summary 22 Review Questions 24 Chapter

relational database on Amazon EC2 is the ideal scenario for users whose application requires a specific, traditional relational database, or for those users who require a maximum level of control and configurability. Relational Database Management Systems (RDBMS) are some of the most w

relational DBMS (RDBMS) software packages Jukić, Vrbsky, Nestorov – Database Systems Chapter 3 – Slide 2 . Once database requirements are collected and visualized as an ER diagram, the next step in creating a relational database is t\൯ map \ 挀漀渀瘀攀爀琀尩 the ER diagram into a relational schema.\

Relational Database Design 15 RELATIONAL DATABASE DESIGN In order to eliminate the nested relation, pull out the nested relation and form a new table Be sure to include the old key in the ne

of the New database wizard deals with the setup of the relational database. 6. In the New database wizard page, check Use existing (since the database has already been created in the DBMS; see3) and press Next . 7. In the Locate database wizard page, press the Build button to create a connection to the relational database.

RDBMS stands for Relational Database Management System. RDBMS is the basis for SQL and for all modern database systems like MS SQL Server, IBM DB2, Oracle, My SQL, and Microsoft Access. A Relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model as introduced by E. F. Codd.