Bayesian Inductive Logic Programming

1y ago
4 Views
2 Downloads
1,015.98 KB
9 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Hayden Brunner
Transcription

Bayesian InductiveLogic ProgrammingStephen MuggletonOxford University Computing LaboratoryWolfson Building, Parks RoadOxford, OX1 3QD.1AbstractInductive Logic Progrdng(ILP) involves theconstruction of first-order definite clause theoriesfrom examples and background knowledge,Unlike both traditional Machine Learning and Computational Learning Theory, ILP is based on lockstep development of Theory, ImplementationsandApplications.ILP systems have successful applications in the learning of structure-activityrulesfor drug design, semantic grammars rules, finiteelement mesh design rules and rules for predictionof protein structure and mutagenic molecules. Thestrong applications in ILP can be contrasted withrelatively weak PAC-learning results (even highlyrestricted forms of logic programs are known tobe prediction-hard).It has been recently arguedthat the mismatch is due to distributionalassumptions made in application domains. These assumptions can be modelled as a Bayesiau prior probability representing subjective degrees of belief.Other authors have argued for the use of Bayesianprior distributionsfor reasons different to thosehere, though this has not lead to a new modelof n prior distributions over time-bounded hypotheses in PAC leads to a new model called ULearnability.It is argued that U-learnabilityismore appropriate than PAC for Universal (Turing computable) languages. Time-bounded logicprograms have been shown to be polynomiallyU-learnable under certain distributions.The useof time-bounded hypotheses enforces decidabilityand allows a unified characterisation of speed-uplearning and inductive learning. U-learnabilityhasas special cases PAC and Natarajan’s model ofspeed-up learning.This paper is written for two separate audiences: 1) a Machine Learning audience, interested in implementationsandapplications and 2) a ComputationalLearning Theory audlence interested in models and results in learnability.Thepaper should thus be taken as two loosely connected paperswith a common theme involving the use of distributionalinformation in Inductive Logic Programming (ILP).In the last few years ILP [23, 24,20, 31] has developed froma theoretical backwater to a rapidly growing area in MachineLearning.This is in part due to the fact that pure logicprograms provide Machine Learning with a representationwhich is general purpose (Turing-computable),and has asimple, and rigorously defined, semantics. In addition, logicprograms tend to be easy to comprehend as hypotheses inscientific domains.ILP is based on interaction between Theory, Implementationsand Applications.The author has argued [25] that the development of a formal semantics of ILP (see Section 2) allowsthe direct derivation of ILP algorithms.Such derivationaltechniques have been at the centre of specific-to-generalILP techniques such as Plotkin’s least general generalisation[37, 36], Muggleton and Buntine’s inverse resolution [27, 23]and Idestam-Almquist’sinverse implication[ 15]. TWO contending semantics of ILP have been developed [31]. Theseare the so-called ‘open-world’semantics [31 ] and ‘closedworld’ semantics [13].A number of efficient ILP systems have been developed, including FOIL [38], Golem [28], LINUS [21], CLAUDIEN[39] and Progol [26]. The use of a relational logic formalismhas allowed successful application of ILP systems in a number of domains in which the concepts to be learned cannoteasily be described in an attribute-value, propositional-level,language. These applications (see Section 3) include structure activity prediction for drug design [19, 44, 43], proteinsecondary-structure prediction [29], finite-elementmesh design [9] and learning semantic grammars [45].Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordirect commercial advantage, the ACM copyright notice and thetitle of the publication and its date appear, and notice is giventhat copyingis by permissionof the AssociationINTRODUCTIONA variety of positive and negative PAC-learnabilityresults exist for subsets of definite clause logic [11, 34, 10, 18, 7, 40].However, in contrast to experimentallydemonstrated abilities of ILP systems in applications, the positive PAC resultsare rather weak, and even highly restricted forms of logicof ComputingMachinery. To copy otherwise, or to republish, requires a feeand/or specific permission.COLT 94- 7/94 New Brunswick, N.J. USA@ 1994 ACM 0-89791-655-7/94/0007. 3.503

where H Q cd, H 1 B and n is a a natural number. Hnis said to hold for examples E, or holds(Hn,E), when boththe following are truel.programs have been shown to be prediction hard [7]. Likemany other results in PAC-learning, positive results are onlyachieved for ILP by setting language parameters to constantvalues (eg. k-clauses and l-literals).This provides ‘hard’boundaries to the hypothesis space, which often reducesthe representation to a propositionallevel. An alternativemodel, U-learnability[30], is better suited to Universal (Turing computable) representations, such as logic programs. Ulearnability provides a ‘soft’ boundary to hypothesis spaces inthe form of a prior probability distribution over the completerepresentation (eg. time-bounded logic programs).Otherauthors have argued for the use of Bayesian prior distributions for reasons different to those here, though this has notlead to a new model of polynomial-timelearnability.UIearnability allows standard distributionalassumptions, suchas Occam’s razor, to be used in a natural manner, withoutparameter settings. Although Occam algorithms have beenstudied in PAC-learning[3], this has not led to algorithmscapable of learning in universal representations, as it has inInductive Logic Programming.In the early 1980’s a distribution assumption very different to Occam’s razor was implicitly implemented by Arbab, Michie [1] and Bratko [4] in analgorithm for constructing decision trees. The algorithm constructs a decision list (linear decision tree) where one existsand otherwise returns the most linear decision tree which canbe constructed from the data. This distributional assumptionis based on the fact that decision lists are generally easier tocomprehend than arbitrary decision trees. Other interestingkinds of distributions along these lines can be imagined; assigning higher prior probabilities to grammars that are regularor almost so, logic programs that are deterministic or almostso and logic programs that run in linear time or almost so.One can imagine very different kinds of prior distributionson hypotheses. For example, when learning concepts whichmimic human performance in skill tasks [41] predictive accuracy is dependent on hypotheses being evaluable in timesimilar to that of human reaction, and so such hypothesesshould be preferred a priori.Sufficiency:Satisfiability:tjn H A E-The prior probability, p(Hn), of an hypothesisa given distribution D. According to Bayes’posterior probability isP( nl ) where p(E[Hn)andis defined bytheorem, thep(EIHn).p(Hn)p(E)is 1 when holds(Hn,p(E) holds(H&E) and O otherwisep(H:),E)Well known strategies for making use of distributionalinformation include a) maximisationof posterior probabilityb)class prediction based on posterior weighted sum of predictions of all hypotheses (Bayes optimal strategy) c) randomlysampling an hypothesis from the posterior probability (Gibbsalgorithm).The sample complexities of strategies b) and c)are analysed in [12].Although no existing ILP system takes an hypothesis distribution D as an input parameter, such distributions are implicitin FOIL’s [38] informationgain measure and Golem’s [32]compression measure. Additionally,search strategies suchas general-to-specificor specific-to-general,use an implicitprior distribution which favours more general hypotheses or,conversely, more specific ones.Bayesian approaches to Machine Learning have been discussed previously in the literature.For instance, Buntine[6] develops a Bayesian framework for learning, which heapplies to the problem of learning class probabilitytrees.Also Haussler et al. [12] analyse the sample complexity oftwo Bayesian algorithms.This analysis focuses on averagecase, rather than worst case, accuracy. On the other handU-learnabilityuses average case time complexity and worstcase accuracy. Also unlike U-learnability(Section 4) neitherBuntine nor Haussler et al, develop a new learning modelwhich allows significantly larger polynomially-learnablerepresentations, such as logic programs. The applications in thefollowing section show that ILP systems are effective at learning logic programs in significant real-world domains. Thisis consistent with the fact that time-bounded logic programsare U-learnable under certain distributions (see Section 4).This paper is organised as follows. Section 2 gives a formaldefinition of ILP in terms of Bayesian inference. Section 3provides an overview of ILP applications in molecular biology and discusses some of the distributionalassumptionsused in these applications.Section 4 defines U-learnabilityand describes some general results on U-learnable distributions.2H h. E BAYES’ ILP DEFINITIONSFamiliaritywith standard definitions from Logic Programming [22, 14] is assumed in the following.A Bayesian version of the usual (open world semantics) setting for ILP is asfollows. Suppose T, 7 and V are sets of predicate symbols,function symbols and variables and cd and Ch are the classesof definite and Horn clauses constructed from T, 7 and V.The symbol En denotes SLDNF derivation bounded by n resolution steps. An ILP learner is provided with backgroundknowledge B z Cd and examples E (E , E- ) in whichE C cd are positive examples and E- C (Ch – cd) arenegat ve examples. Each hypothesis is a p r H. (H, n)3APPLICATIONSBIOLOGYIN MOLECULARILP applications in Molecular Biology extend a long tradition in approaches to scientific discovery that was initiated byMeta-Dendral’s[5] application in mass spectrometry. Molecular biology domains are particularly appropriate for ILP due1Though most ILpsystemscan deal with noise, this is ignoredin the above definition for the sake of simplicity.4

Ato the rich relational structure of the data. Such relationshipsmake it hard to code feature-based representations for thesedomains. ILP’s applications to protein secondary structureprediction [29], structure-activityprediction of drugs [19] andmutagenicity [42] of molecules have produced new knowledge, published in respected scientific journals in their area.THzNH3.1PROTEIN SECONDARY STRUCTUREPREDICTIONk,When a protein (amino acid sequence) shears from the RNAthat codes it, the molecule convolves in a complex thoughdeterministic manner. However, it is largely the final shape,known as secondary and tertiary structure, which determinesthe protein’s function. Determinationof the shape is a longand Iabour intensive procedure involving X-ray crystallography. On the other hand the complete amino acid sequence isrelatively easy to determine and is known for a large numberof naturally occurring proteins. It is therefore of great interest to be able to predict a protein’s secondary and tertiarystructure directly from its amino acid sequence.NH/Hzdrug development projects. However, the in vitro activity ofmembers of a family of drugs can be determined experimentally in the laboratory. This data is usually analysed by drugdesigners using statistical techniques, such as regression, topredict the activity of untested members of the family ofdrugs. This in turn leads to directed testing of the drugspredicted to have high activity.An application of ILP to structure-activityprediction was reported in [19]. The ILP program Golem [28] was appliedto the problem of modelling the structure-activityrelationships of trimethoprimanalogues binding to dihydrofolatereductase. The training data consisted of 44 trimethoprimanalogues and their observed inhibition of Escherichia colidihydrofolate reductase. A further 11 compounds were usedas unseen test data. Golem obtained rules that were statistically more accurate on the training data and also better on thetest data than a previously published linear regression model.Figure 1 illustrates the family of analogues used in the study.However, according to the domain expert, Mike Sternberg,Golem’s secondary-structureprediction rules were hard tocomprehend since they tended to be overly specific and highlycomplex. In many scientific domains, comprehensibilityofnew knowledge has higher priority than accuracy. Such apriori preferences can be modelled by a prior distribution onhypotheses.DRUG STRUCTURE-ACTIVITYN/\’Figure 1: The family of analogues in the first ILP study. A)Template of 2,4-diamino-5(substituted-benzyl)pyrimidinesR3, R4, and R5 are the three possible substitution positions.B) Example compound: 3 – Cl, 4 – iViZZ, 5 – C’HJMany attempts at secondmy structure prediction have beenmade using hand-coded rules, statistical and pattern matchingalgorithms and neural networks. However, the best results forthe overall prediction rate are still disappointing (65-70%).In [29] the ILP system Golem was applied to learning secondary structure prediction rules for the sub-domain of a/aproteins.The input to the program consisted of 12 nonhomologous proteins (1612 amino acid residues) of knownsecondary structure, together with background knowledgedescribing the chemical and physical properties of the naturally occurring amino acids. Golem learned a set of 21rules that predict which amino acids are part of the a-helixbased on the positional relationships and chemical and physical properties.The rules were tested on four independentnon-homologousproteins (416 amino acid residues) givingan accuracy of 8190 (with 2% expected error). The best previously reported result in the literature was 76%, achievedusing a neural net approach. Although higher accuracy ispossible on sub-domains such as a/a than is possible in thegeneral domain, ILP has a clear advantage in terms of comprehensibility over neural network and statistical techniques.3.2\A-Owing to the lack of variation in the molecules, a highlydeterministic subset of the class of logic programs was sufficient to learn in this domain. Unlike the protein domain, thedomain expert found the rules to be highly compact and thuseasy to comprehend.PREDICTION3.3Proteins are large macro-moleculesinvolving thousands ofatoms. Drugs tend to be small, rigid molecules, involvingtens of atoms. Drugs work by stimulating or inhibiting theactivity of proteins. They do so by binding to specific siteson the protein, often in competition with natural regulatorymolecules, such as hormones. The 3-dimensional structureof the protein binding site is not known in over 90% of allMUTAGENESIS PREDICTIONIn a third molecular biology study ILP has been applied topredicting mutagenicity of molecules [42]. Mutagenicityishighly correlated to carcinogenicity.Unlike activity of adrug, mutagenicity is a property which pharmaceutical companies would like to avoid. Also, unlike drug developmentprojects, the data on mutagenicity forms a highly heteroge-5

:.B,NYclx 32uxv—Y. . . . .Figure 2: Examples of the diverse set of aromatic andheteroaromaticnitro compounds used in the mutagenesis study.A) 3,4,4’-trinitrobiphenylB) 2-nitro-l,3,7,8tetrachlorodibenzo-1,4-dioxinC) 1,6,-dinitro-9,10,11,12tetrahydrobenzo [e]pyrene D) nitrofurantoin\/Q\i“)Y—. zNO,—N02—N02—Figure 3: Structural properties discovered by Progolpressing precisely this fact, was discovered by Progol withoutaccess to any specialist chemical knowledge. Further, theregression analysis suggested that electron-attractingelementsconjugated with nitro groups enhance mutagenicity.A particular form of this pattern was also discovered by Progol(pattern 3 in Figure 3).neous set with large variations in molecular form, shape andchame distribution.The data comes from a wide number ofsources, and the molecules involved are almost arbitrary inshape and bond connectivity (see Figure 2).Previous published results for the same data [8] used lineardiscriminant techniques and hand-crafted features. The ILPtechnique allowed prediction of 49 cases which were statistical outliers not predictable by the linear discriminant algorithm. In addition, the ILP rules are small and simple enoughto provide insight into the molecular basis of mutagenicity.Such a low-level representation opens up a range of arbitrarychemical structures for analysis. This is of particular interestfor drug discovery, since all compounds within a database ofcompounds have known atom and bond descriptions, whilevery few have computed features available.The 2D bond-and-atommolecular descriptions of 229 aromatic and heteroaromaticnitro compounds taken from [8]were given to the ILP system Progol [26]. The study wasconfined to the problem of obtaining structural descriptionsthat discriminate molecules with positive mutagenicity fromthose which have zero or negative mutagenicity.A set of 8 compact rules were discovered by Progol. Theserules suggested 3 previously unknown features leading tomutagenicity.Figure 3 shows the structural properties described by the rules in the Progol theory. The descriptionsof these patterns are as follows. 1) Almost half of all mutagenic molecules contain 3 fused benzyl rings. Mutagenicityis enhanced if such a molecule also contains a pair of atomsconnected by a single bond where one of the atoms is connected to a third atom by an aromatic bond. 2) Another strongmutagenic indicator is 2 pairs of atoms connected by a singlebond, where the 2 pairs are connected by an aromatic bond asshown. 3) The third mutagenic indicator discovered by Progol was an aromatic carbon with a partial charge of 0. 191 inwhich the molecule has three N02 groups substituted onto abiphenyl template (two benzine rings connected by a singlebond).Owing to the large range of molecules the rules were highlynon-deterministic(of the form “there exists somewhere inmolecule M a structure S with property P“). Owing to simplicity, the mutagenicity rules were the most highly favouredby the expert among the Molecular Biology domains studied.For all the domains in Molecular Biology comprehensibility,rather than accuracy, provides the strongest incentive for anOccam distribution (ie. shorter theories preferred a priori).4U-LEARNABILITYThe learnability model defined in this section allows for theincorporation of a priori distributions over hypotheses. Thedefinition of this model, U-learnability,is motivated by thegeneral problems associated with learning Universal (Turingcomputable) representations, such as logic programs. Nevertheless U-learnabilitycan be easily applied to other representations, such as decision trees. U-learnabilityis defined usingAn interesting feature is that in the original regression analysis of the molecules by [8], a special ‘indicator variable’was provided to flag the existence of three or more fusedrings (this variable was then seen to play a vital role in theregression equation). The first rule (pattern 1 in Figure 3), ex-6

the traditional notation from computationallearning theoryrather than that typically used in ILP (see Section 2).4.1q(lr[, P( Iwl)) over the examples xi seen m that point,2(Recall that (r, p) is the target concept chosen randomlyaccording to D1, and that q describes the time compkxity of the evaluation algorithm A.)BACKGROUND FOR U-LEARNING Let (ZE, X, Zc, R, c) be the representationof a learningproblem (as in PAC-learning),where X is the domain ofexamples (finite strings over the alphabet ZE), R is the classof concept representations (finite strings over the alphabetXc), and c : R 2X maps concept representations to concepts (subsets of X). Hence the conce t class being learnedis the actual range of c (a subset of 2 ), Let P be a subsetof the class of polynomialfunctions in one variable, Fromthis point on, a concept representation will be a pair (r, p),forr ERandpEP.at least 1 – J the hypothesis Hm (rm, pm) is (1 –we mean that theC)-accurate.By (1 - t)-accurate,probability (according to D2) that A(rm, pm, Xm l ) #lm l is less than e.As with PAC-learning,we can also consider a predictionvariant, in which the hypotheses llm need not be ‘built fromR (in addition to P), but can come from a different class R’and can have a different associated evaluation algorithm A’.Let A(r, p, z) be an algorithm that maps any tuple (r, p, z),forrgR,pCP,andzEX,toO orl.Let Arunin time bounded by q( lrl, P(IZ [)), where q is a polynomialin two variables.Furthermore,let A have the followingproperties.First, for all r 6 R and z c X, if z @ c(r)then for every p E P: A(r, p, z) O. Second, for allr E R and z E X, if z E c(r) then there exists p E P suchthat for every p’ E P with p’(lzl)z p(lxl):A(r, p’, ) 1. Intuitively,p specifies some form of time bound thatis polynomiallyrelated to the size of the example. It mightspecify the maximum derivation length (number of resolutionsteps) for logic programs or the maximum number of stepsin a Turing Machine computation. Greater derivation lengthsor more computation steps are allowed for larger inputs.4.4Theorem 2 U-learnabilitygeneralises PAC. Let p be thepolynomialjimctionp(x) x. Let F be afamily of distributions that contains one distribution Di for each ri c R, suchthat Di assigns probability1 to (ri, p). Let n O be the parameterfor each member of F. Let G be thefamily of all distributionsoverX. Then the representation (ZE, X, xc, R, c)is PAC-leamable (PAC-predictable)if and only if (F, G] ispolynomiallyU-learnable (U-predictable).x.PROTOCOL FOR U-LEARNINGThis observation raises the question of how, exactly, polynomial U-learnabilitydiffers from PAC-learnability.There is acosmetic difference that polynomial U-learnabilityhas beendefined to look more similar to identi cationin the limitthan does the definition of PAC-learnability.Ignoring thisleaves the followingdistinguishingfeatures of polynomialU-learnability.In U-learning, a Teacher randomly chooses a pair (r, p), forr E R and p c P, according to some distributionD1 inF. The Teacher presents to a learning algorithm L an infinite stream of labeled examples {(z1, 11), (Z2, 12), .) where:each example xi is drawn randomly, independently of thepreceding examples, from X according to some fixed, unknown distribution D2 in G and labeled by li A(r, p, i).After each example Zj in the stream of examples, L outputsa hypothesis Hi (ri, pi), where ri E R and pi c P.4.3@ be rn re precise, let m be any positive integer.btPrD, (r, p) be the probability assigned to (r, p) by DI, and (witha stight abuse of notation) let PrD1,Dz (r, P, ( 1, . z )) be theproduct of PrDl (r, p) and the probability of obtaining the sequence(x,, . z ) when drawing m examples randomly and independently according to Dz. LetlTME (r, p, (X I,. z )) be the timespent by L until output of its hypothesis I-f , when the target is(r, p) and the initial sequenceof m examples is (XI, . z ). Thenwe say that the average-casetime complexity of L to output Ifm isbounded by pI (y) y just if the sum (or integral), over all tuples(r,p, (m, .zm)). ofDEFINITION OF U-LEARNABILITYDefinition1 PolynomialU-learnability.Let F be a familyof distributions (with associatedparameters)over R x P, andlet G be a family of distributionsover X. The pair (F,G)is polynomiallyU-learnable just if there exist polynomialfunctions pI (y) y’, for some constant c, P2(V), p3(yI, y2),and a learning algorithm L, such that for every distributionD1 (with parameter n) in F, and every distribution D2 in G, hefollowing hold. INITIAL RESULTS AND PROPOSEDDIRECTIONSThis section presents initial results about polynomialUIearnability and suggests directions for further work.Toconserve space the proofs are omitted, but they appear in afull paper on U-learnability[30]. In each of these results, let(X , X, XC, R, c) be the representation of any given learningproblem. The following theorem shows that PAC-learnabilityis a special case of U-learnability,Let F be any family of probability distributions over R x P,where the distributions in F have an associated parametern 0. Let G be any family of probability distributions over4.2For all m p2(n), there aist values E and d, O c, 8 1, such that: p3 ( , ) m, and with probabilityis less than infinity, where M is the sum of g(lrl, p(lz, l)) over all%, in a 1,. x . See [2] for the motivation of this definition ofaverage-casecomplexity.The average-case time complexi of L at any point ina run is bounded by p] (M), where M is the sum of7

o Assumptionof a priorrepresentations.distributionon target concept Sample size is related to e, d, and (by means of the parameter n) the reciprocal of the probability of the targetrepresentation rather than the target representation size. Use of average case time complexity,relative to theprior distribution on hypotheses and the distribution onexamples, in place of worst case time complexity. Concept representations need not bepolynomiallyevaluable in the traditional sense, but only in the weaker sensethat some polynomial p must exist for each concept representation such that the classification of an example zcan be computed in time P( lx 1). This distinction and thenext are irrelevant for concept representations that arealready polynomiallyevaluable and hence do not needtime bounds. builtfmm a given alphabet of predicate symbols P, jimctionsymbols F, and variables V. Notice that R is countable.Let P be the set of all bounds p(x) on derivation length ofthe form p(z) CX, where c is a positive integer and x isthe size of (number of terms in) the goal. Let the domainX of examples be the ground atomic subset of cd, which isthe set of dejinite clauses that can be built from the givenalphabet. Notice also that a concept (r, p) classifies as positive just the dejinite clauses d E X such that r I-P(I dlJ d,where Idl is the number of terms in d. Let F be the family ofdistributions DP built using some particularenumeration ofpolynomially-growingsubsets of R, and let G be thefamily ofall distributions over examples. From Theorem 4, it followsthat (F, G) is polynomiallyU-learnable.Definition6 The distributionfamily DE. Let R be countable, let k be a positive intege and let Ee (El, E2, .) bean enumeration of subsets of R such that: (1) for all i 1,the cardinality of Ei is at most ki, (2) every (r, p), for r c Rand p G P, is in some Ei, and (3) the members of Ei, foreach i, can be determined efficiently (by an algorhhm thatruns in timepolynomialin the sum of the sizes of the membersof Ei). Let D be the discrete exponential distribution withprobability densi finctionThe learning algorithm runs in (expected) time polynomial in the ‘(w&st-case) time r uiredto compute theclassification, by the target, of the examples seen thusfar, rather than in the sum of the sizes of the examplesseen thus far.Definition3 The distributionfamily Dp. Let R be countable, let p’(y) be a polynomial function, and let EPI (EI, E2, .) bean enumeration of subsets of R such that:(1) for all i 1, the caniinalityof E is at most p’(i), (2)every r E R ls in some Ei, and (3) the members of E , foreach i, can be determined eflciently (by an algorithm thatruns in time polynomial in the sum of the sizes of the membersof E J Let F’ be the family of all distributions D), with finitevariance, over the positive integers, and let the parameter n)of D’ be the maximum of the mean and standard deviationof D’. For each such distributionD’ in F’, define a corresponding distrt”bution D{{ over R as follows: for each i 1,of the r G Ei be unijomtly distributedlet the probabilitieswith sum PrDl (i). Let the parameter associated with D“also be n’. Let P be the set of all linear functions of theform p(x) CX, where c is a positive integer Let D beany probabilitydistributionover the positive integers thathas apmbabilitydensi finction (p.d. ) prD; (x) , forPr(x)and let the parameterfor all integers x 1n’ of D be its mean { ).From D;,de ne a corresponding distribution Dk over R asfollows: foreach i 1, let the probabilitiesof the r c Ei be uniformlydistributed with sum Prl); (i). Let the parameter associatedwith Dk also be n’. Let P be the set of all linear functionsof the form p(x) CX, where c is a positive integer Let thedistributions DL over P be as defined earlier (see definitionof Dp }. Finally, for the distributionDk (with parameternl) and each distributionDL, dejine a distn”bution D overR x Pas follows: for each pair (r, p), for r E R andp G P,(r, P) [prD,(r)] [prDL(P)]. Let the parameter n of Dbe the parameter n’ of Dk. DE, is the family of distributionsprDconsisting of each such distributionD.underDE.LetTheorem 7 PolynomialU-1earnabilityF be the distribution family DE, dejined by a particularenumeration of subsets of R and a particularchoice of k,Let G be any family of distributionsover X, The pair ofdistributions (F, G) is polynomiallyU-learnable.some k 3 (appropriatelynormalised so that it sums to 1).For each distributionD , dejine a distributionDL over Pto assign to the function p(z) cx the probability assignedto c by D . Finally, for each pair of distributionsD“ andDL as de ned above, define a distributionD over R x Pas follows: for each pair (r, p), where r E R and p P,prD (r, p) [PrD1/ (r)] [PrDL (p)]. t the parameter n of Dbe n’. DP is the family of all such distributions D, each withassociated parameter n.Example 8 Time-bounded logic programs are U-learnableunder DEh. Again let R be the set of all logic programs thatcan be builtfrom a given nite alphabet ofpredicate symbolsP, function symbols F, and variables V. And again let P bethe set of all bounds p(z) on den’vation length of the formp(x) CX, where c is a positive integer and x is the size of(number of terms in) the goal. Let the domain X of examplesbe the ground atomic subset of

applications and 2) a Computational Learning Theory audl-ence interested in models and results in learnability. The paper should thus be taken as two loosely connected papers with a common theme involving the use of distributional information in Inductive Logic Programming (ILP). In the last few years ILP [23, 24,20, 31] has developed from

Related Documents:

The inductive learning and logic programming sides of ILP (cont') Inductive logic programming extends the theory and practice of logic programming by investigating induction rather than deduction as the basic mode of inference - Logic programming theory describes deductive inference from logic formulae provided by the user

We further the research into inductive logic programming and inductive con- cept learning using equational logic that was begun by Hamel [1, 2, 3] and Shen [4], as well as the inductive processes used in the functional programming system

equations. The resulting theory is an improvement over previous treat-ments of inductive-inductive and indexed inductive definitions in that it unifies and generalises these into a single framework. The framework can also be seen as a first approximation towards a theory of higher inductive types, but done in a set truncated setting.

categorical and hypothetical syllogism, and modal and inductive logic. It is also associated with the Stoics and their propositional logic, and their work on implication. Syllogistic logic and propositional logic led later to the development of predicate logic (or first order logic, i.e. the foundational logic for mathematics)

BRIEF LADDER LOGIC OVERVIEW Page 2 18.05.2015 1.2 What is Ladder logic? Ladder logic, also known as a Ladder diagram, is a method for programming for Program-mable Logic Controls. Ladder Logic is a standardized type of graphic programming, which is similar to a circuit diagram. Programming with ladder logic is used, in particular, for creat-

Dynamic Logic Dynamic Circuits will be introduced and their performance in terms of power, area, delay, energy and AT2 will be reviewed. We will review the following logic families: Domino logic P-E logic NORA logic 2-phase logic Multiple O/P domino logic Cascode logic

2.1 Use Inductive Reasoning Obj.: Describe patterns and use inductive reasoning. Key Vocabulary Conjecture - A conjecture is an unproven statement that is based on observations. Inductive reasoning - You use inductive reasoning when you find a pattern in specific cases and then write a conjecture for the general case. Counterex

13th Prepare inductive lesson plan Inductive grammar lesson: Direct object, indirect object, and reflexive pronouns 14th Prepare inductive lesson plan Inductive grammar lesson: relative pronouns 15thth. Composition #2: "Historia de horror" Words: 300-350 Time: 50 min. No dictionary or translation devices allowed.