Proc SQL, The Data Step Killer

2y ago
14 Views
2 Downloads
677.76 KB
55 Pages
Last View : 5m ago
Last Download : 3m ago
Upload by : Harley Spears
Transcription

Proc SQL, the Data Step KillerMike AtkinsonAcko Systems Consulting IncData StepProc SQL

The Data Step It’s a complete programming language Easy to combine (or merge) datasets It allows you to perform sequential algorithm steps Some algorithms require data to be in sorted order first Can use values of variables from different observations, using retain It has the “where” statement (stolen from SQL) that allows efficient use of filtering criteria prior to other processingThere are some things the data step can do that can’t be done inProc SQL, e.g. Can create multiple datasets in one step Can easily join multiple datasets – each left, right, or full outer join in aProc SQL query can join only two datasets at a time (although innerjoins without the join keyword can bring together any number)

Proc SQL SQL is the de facto standard query language, widely used (beyond SAS even!) for retrieving andsummarizing dataProc SQL can summarize results in the same step asperforming row level calculations without Proc SQL, summarizing requires a separate procsummary step, and often a pre-sortProc SQL can sort its results in the same stepIt can perform distinct countsIt can use “like” expressionsWhile SQL isn’t a complete programming language, ithas “case” expressions, which can be very powerful

Some stuff SAS Proc SQL can do Sending (pass-through) queries to Oracle (or another DBMS)for processing, and receiving the results into a SAS dataset Administration tasks, such as managing SAS datasets andindexes Using the SQL language against SAS datasets as analternative to the Data Step Setting values of macro variables As an alternative to Proc Print

Data stepBasic syntax:data new SAS dataset;set some existing dataset;/* set some existing dataset (keep column 1column 2); to subset variables */* do stuff;run;

Proc SQL Create TableBasic syntax:proc sql;create table new SAS dataset as/* select * for all columns/variables */select column 1,column 2from some existing dataset;quit; Although it says create table, it is actually creating a SAS dataset. PROC SQL terminates with a quit; statement (not run;).

WHERE clauseCan be used in data step statement and within Proc SQL,including within a “case” expressionComparison operators , , , , , or lt, gt, eq, le, ge, neLogic operators, bracketsand, or, note.g. ((a b) and (not (c d)))IN operatorin (values, separated, by, commas)

WHERE clause “like” & “contains”While the “in (list)” has made its way to the IF statement, “like”(and “contains”) have not; they are only in the WHEREThe syntax for CONTAINS is straightforward, e.g.where name contains 'JONES'But I prefer LIKE, which is more powerfulpercent sign (%) – match zero or more charactersunderscore ( ) – match any one charactere.g.where name like 'JONES%'

ORDER BY clauseNot only does PROC SQL not require data to be sortedbeforehand, but you can ask it to sort its resulting outputsimply by adding an ORDER BY clauseThe ORDER BY clause appears last, after the GROUP BY clauseand the HAVING clause, if those are presentThe ORDER BY clause can be used on his own, without groupingThe syntax of the ORDER BY clause is slightly different than the Data Step(and other Procs’) BY statements; the BY statement separates variablesby spaces, while the ORDER BY separates them using commas.

GROUP BY clauseThe GROUP BY clause in Proc SQL lets you summarise data (similar to ProcSummary) but without requiring the data to be sorted beforehand.The GROUP BY clause (if present) follows the WHERE clauseVariables in the GROUP BY are separated by commasUsing a GROUP BY statement does not guarantee the sort order of theresults (although SAS is more likely to put the data into that order thanOracle is). Use an ORDER BY with a GROUP BY if you need to ensure thesort order of the data.Note: the variables in the ORDER BY clause need not match the variables inthe GROUP BY clause.

Summary functionsIf you have a GROUP BY in your query, then every variable youselect should either be listed in the GROUP BY, or besummarised in some way. If you select a variable that is notsummarised and not listed in the GROUP BY clause, you willalmost certainly not get the summarized results you expect.Here are some sample summary functions:sum(services)as services,max(service date)as max servdt,mean(paid amount)as avg paidamt,count(distinct PHN) as patient count

HAVING ClausThe HAVING clause applies after the GROUP BY, WHEREas the WHEREclause applies before groupingThe HAVING clause looks at summarised values, and cannot be usedwithout a GROUP BY clausee.g.proc sql;create table three or more asselect service date,count(*)as record countgroup by service datehaving count(*) 3;quit;

CASE expressionThis is PROC SQL’s closest equivalent to the IF statement. ACASE expression, however, can only return a single value.(an IF statement can use a do/end to to perform multiple actions)The CASE expression consists of a series of WHEN conditions(that use the same syntax as WHERE conditions), followedby ELSE. So it’s really more like an IF THEN/ELSE.Each WHEN condition is accompanied by a THEN expressionthat evaluates to a value.The CASE expression will use the THEN expression of the firstWHEN condition that is found to be True. If none of theWHEN conditions are true, the ELSE expression will be used.It’s good practice to always have an ELSE.

CASE expression exampleproc sql;select case when age 0when age between 1 and 5when age between 6 and 10when age between 11 and 15 else '?' endcount(distinct recipient id)from health servicesgroup by calculated age group;quit;thenthenthenthen' 0'' 1- 5'' 6-10''11-15'as age group,as person cnt

AliasesWhen joining two or more tables, it is useful to use an alias foreach table.The alias can be used as a prefix to variable names to indicatewhich table the variable comes from, which is handier thanusing the whole table name as a prefix.When a variable of the same name appears in more than onetable (being joined using a Proc SQL select statement), youmust specify which table you want to refer to each time yourefer to the variable name. Prefixing variables with the tablealias is the usual way to do this.

LEFT JOIN, RIGHT JOINThe default SQL join is an Inner Join, meaning that only rows that matchacross both tables are includedLEFT JOIN and RIGHT JOIN in Proc SQL always operate on exactly two tables,and the order the tables are listed is very significant. Imagine writingthem on the same line – the first dataset listed is the Left one, and thesecond is the Right dataset.When you use LEFT JOIN or RIGHT JOIN, you use the ON keyword (instead ofthe WHERE keyword) to indicate the join criteria.If you use the INNER JOIN syntax to perform an inner join, you will also needto use the ON keyword

Comparing Inner, Left, and Right joinsHere’s some sample data in two datasets.StudentsStudent ID345678NameGray, JaneAdams, GiselleKeppel, LenGradesStudent ID34345699SubjectMathEnglishMathFrenchGradeABC F

Inner Join (usual, without JOIN keyword)proc sql;create table after inner asselect a.*,b.*aliasfrom studentsa,gradesbwhere a.student id b.student idorder by a.student id;quit;Note: This will give a note in the log that student id already exists in thedataset. Because student id is the same in both datasets (guaranteedby the WHERE condition), this note can be safely ignored.

Okay, here’s how you could rid of the note(without listing all the columns you want)proc sql;create table after inner (drop student id2) asselect a.*,b.*from studentsa,grades(rename (student id student id2)) bwhere a.student id b.student id2order by a.student id;quit;It’s probably easier just to ignore the note in the log.

Results of (default) Inner JoinGradesStudentsStudent ID345678NameGray, JaneAdams, GiselleKeppel, LenStudent ID34345699SubjectMathEnglishMathFrenchDefaultInner Joinon student idAfter InnerStudent ID343456NameGray, JaneGray, JaneAdams, GiselleSubjectMathEnglishMathGradeABC GradeABC F

LEFT Joinproc sql;create table after left asselect a.*,b.*from studentsa left joingradesbon a.student id b.student idorder by a.student id;quit;

Results of Left JoinGradesStudentsStudent ID345678NameGray, JaneAdams, GiselleKeppel, LenStudent ID34345699SubjectMathEnglishMathFrenchLeft Joinon student idAfter LeftStudent ID34345678NameGray, JaneGray, JaneAdams, GiselleKeppel, LenSubjectMathEnglishMathGradeABC GradeABC F

RIGHT joinproc sql;create table after right asselect a.*,b.*from studentsa right joingradesbon a.student id b.student idorder by a.student id;quit;

Results of Right JoinGradesStudentsStudent ID345678NameGray, JaneAdams, GiselleKeppel, LenStudent ID34345699SubjectMathEnglishMathFrenchRight Joinon student idAfter RightStudent ID343456NameGray, JaneGray, JaneAdams, GiselleSubjectMathEnglishMathFrenchGradeABC FGradeABC F

FULL (Outer) joinproc sql;create table after full asselect coalesce(a.student id, b.student id) asstudent id,a.name,b.subject,b.gradefrom studentsa full joingradesbon a.student id b.student idorder by a.student id;quit;

Results of Full (Outer) JoinGradesStudentsStudent ID345678NameGray, JaneAdams, GiselleKeppel, LenStudent ID34345699SubjectMathEnglishMathFrenchFull Joinon student idAfter FullStudent ID3434567899NameGray, JaneGray, JaneAdams, GiselleKeppel, LenSubjectMathEnglishMathGradeABC FrenchFGradeABC F

Traditional SAS Code(Data Step needs helpers!)proc sort data prac info;by prac lha;run;proc summary data prac info;by prac lha;output out prac lha counts(drop type rename ( freq prac cnt));run;27

Proc SQL doing a “summary”proc sql;create table prac lha counts asselect prac lha,count(*) as prac cntfrom prac infogroup by prac lhaorder by prac lha;quit;28

Calculated keyword in Proc SQLThe keyword “calculated” can be used to refer to acolumn being created within a Proc SQL query byname, in a reference later within the same query.It can be used to reference a calculated column withinthe GROUP BY expression, or even in expressions tocreate other columns.There is no abbreviation for “calculated”.

Traditional SAS Codesummarize and lookup a descriptionproc sort data fitm servcd;by servcd;run;proc summary data fitm servcd;by servcd;output out servcd fitm cnts 0(drop type rename ( freq fitm cnt));run;data servcd fitm cnts;set servcd fitm cnts 0;servcd descrip put(servcd, svcd2ds.);run;30

Proc SQL Codeproc sql;create table servcd fitm cnts asselect servcd,put(servcd, svcd2ds.) as servcd descrip,count(*)as fitm cntfrom fitm servcdgroup by servcd, calculated servcd descriporder by servcd;quit;31

Partial results

Proc SQL Codewith joinproc sql;create table servcd fitm cnts asselect a.servcd,b.servcd descrip,count(*)as fitm cntfrom fitm servcdaleft joinservice codesbon a.servcd b.servcdgroup by a.servcd, b.servcd descriporder by 1, 2;quit;33

Select desired observationsusing a Data Step%let startdt sas '01apr2012'd;%let enddt sas '31mar2013'd;data data centres 2;set data centres;where efctvdt &enddt sasand cncldt &startdt sasand dt cntr status in ('D', 'P')and dt cntr typein ('C', 'P')and not ( ' ' dt cntr nm ' ' like '% HOLDINGS %'' ' dt cntr nm ' ' like '% HOSP%'' ' dt cntr nm ' ' like '%SYS%' );run;oror

Anything you can do (well, not anything, but this thing )proc sql;create table data centres with flags asselect efctvdt,cncldt,dt cntr status,dt cntr type,case when (efctvdt &enddt sas) or(cncldt &startdt sas)then '1. Outside date range'when (dt cntr status not in ('D', 'P'))then '2. Status not D or P'when (dt cntr type not in ('C', 'P'))then '3. Type not C or P'when (' ' dt cntr nm ' ' like '% HOLDINGS %'or ' ' dt cntr nm ' ' like '% HOSP%'or ' ' dt cntr nm ' ' like '%SYS%' )then '4. Computing type'else ' 'endas error typefrom data centres;quit;

Informative reportproc freq data data centres with flags;tables error type / missing;run;

Getting the goods, either wayproc sql;create table data centres 2 asselect *from data centres with flagswhere error type is null;quit;ordata data centres 2;set data centres with flags;where error type is null;run;

Distinct keywordIf “distinct” appears as in “select distinct”, it applies to allselected columns, and is basically the same as using PROCSORT with NODUP. e.g.select distinct provider, patient,service dateDistinct can also appear within a count summary function. e.g.count(distinct provider) as uniq prac cnt,count(*)as record cnt,count(provider)as cnt recs w provider

Demonstration of calculatedproc sql;create table attached w age range asselect *,floor(yrdif(datepart(birth date),'31mar2013'd, 'AGE'))as age,5 * (floor(calculated age/5)) as age temp,case when (calculated age) 0 then '000'else put(calculated age temp, z3.) '-' put(calculated age temp 4, z3.)end as age rangefrom attached 2012 2013order by res at yr end;quit;

Some sample data data specialty claims;infile cards4;input specialty clntage paidamt;cards4;00 5500000 10 1000000 20 1000000 30 1000000 40 1500000 50 2500000 60 3500000 70 5500000 80 7500000 90 8500001 10 1500001 20 15000;;;;run;

Get percent costs for patients (clients)aged 65 or overproc sql;create table pct over 65 asselect specialty,sum(paidamt)as paidamt,sum(case when clntage 65 then paidamtelse 0 end) as paidamt ge65,(calculated paidamt ge65) / (calculated paidamt)as pct paid over 65format percent7.1from specialty claimsgroup by specialty;quit;

Results – percent costs for patients aged 65 or over

Practitioner, service code datadata prac servcd;infile cards4;input pracnum servcd paidamt;cards4;00001 01 5000000001 12 1000000001 91 2000000002 01 4500000002 90 800000003 01 6000000003 12 500000003 92 30000;;;;run;

Flag practitioners with all threeproc sql;create table prac 3 asselect pracnum,sum(case when servcd 01 then paidamt else 0 end)as paidamt 01,sum(case when servcd 12 then paidamt else 0 end)as paidamt 12,sum(case when servcd 89 then paidamt else 0 end)as paidamt 89 plus,case when (calculated paidamt 01) 0 and(calculated paidamt 12) 0 and(calculated paidamt 89 plus) 0then 1else 0 endas all 3from prac servcdgroup by pracnum;quit;

Practitioners with flag for all 3

Subset to those with all 3proc sql;create table prac all 3 asselect *from(select pracnum,sum(case when servcd 01 then paidamt else 0 end)as paidamt 01,sum(case when servcd 12 then paidamt else 0 end)as paidamt 12,sum(case when servcd 89 then paidamt else 0 end)as paidamt 89 plus,case when (calculated paidamt 01) 0 and(calculated paidamt 12) 0 and(calculated paidamt 89 plus) 0then 1else 0 endas all 3from prac servcdgroup by pracnum)where all 3;quit;

Subset to Practitioners with flag for all 3

Select Values into Macro Variableproc sql noprint;select count(distinct pracnum) into :prac cntfrom pracds;quit;%put prac cnt &prac cnt;Results (In Log):prac cnt 2479348

Select Values into Macro Variablesproc sql noprint;select sum(popn), count(*)into :pop lha61, :rec cnt lha61from people datawhere lfsclyr 20042005and clntlha 61;quit;%put pop lha61 &pop lha61;%put rec cnt lha61 &rec cnt lha61;Results (In Log):pop lha61 208372rec cnt lha61 4049

Aside: Formatting Macro Variables forFootnotes%let pop fmtd %sysfunc(putn(&pop lha61, comma9.));footnote1 j l “Note: LHA 61 had population of&pop fmtd in 2004/2005”;Resulting footnote:Note: LHA 61 had population of 208,372 in 2004/200550

Select list into macro variableproc sql noprint;select distinct servcd into :servcd listseparated by ", "from prac servcd;quit;%put &servcd list;Results (In Log):1, 12, 90, 91, 92

Many to Many JoinsAre Possible with Proc SQLproc sql stimer;create table uniq pracs asselect distinct pracnum from prac servcd;create table uniq servcd asselect distinct servcd from prac servcd;create table prac servcd combos asselect a.pracnum,b.servcdfrom uniq pracsa,uniq servcdb;quit;52

All possible combinations of practitioner andservice code

Proc SQL instead of Proc PrintProc SQL is a handy alternative to Proc Print when youwant to get a quick report.Simply leave out the bit “create tableresult table as”, and start with select You might find it handy to use a TITLE statement before the SELECT statement.proc sql;title "List of all subjects";select distinct subjectfrom gradesorder by subject;quit;

Proc SQL, e.g. Can create multiple datasets in one step Can easily join multiple datasets –each left, right, or full outer join in a Proc SQL query can join only two datasets at a time (although inner joins without the join keyword can bring together any number)File Size: 677KBPage Count: 55

Related Documents:

Agenda PROC SQL VS. DATA STEP PROCESSING Comparison of DATA Step and PROC SQL capabilities Joining SAS data using the DATA Step and PROC SQL Data Step and SQL Output Additional Comparisons Additional Resources SQL Data Step VS

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

2. proc sql statement 1 ; 3. proc sql statement 2 ; 4. quit; /* required */ Lines 2, 3: all of the examples in the e-Guide are expected to be embedded between lines 1 and 4. SAS accepts one or more PROC SQL statements within each PROC SQL block of code. Eight common benefits for using PROC SQL:

USING PROC SQL As stated above, PROC SQL is SAS's implementation of the Structured Query Language. This procedure takes on the following general form: PROC SQL; sql-statement; QUIT; The sql-statement referenced above can be any SQL clause (e.g., ALTER, CREATE, DELETE, DESCRIBE, DROP, INSERT, SELECT, UPDATE, or VALIDATE).

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have