Avoiding Plagiarism In Markov Sequence Generation

2y ago
31 Views
2 Downloads
539.85 KB
8 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Asher Boatman
Transcription

See discussions, stats, and author profiles for this publication at: Avoiding Plagiarism in Markov SequenceGenerationConference Paper · July 2014CITATIONSREADS11633 authors, including:Alexandre PapadopoulosPierre RoySpotifySpotify20 PUBLICATIONS 145 CITATIONS67 PUBLICATIONS 822 CITATIONSSEE PROFILESEE PROFILESome of the authors of this publication are also working on these related projects:Flow Composer View projectEthorobotics View projectAll content following this page was uploaded by Pierre Roy on 02 November 2015.The user has requested enhancement of the downloaded file.

Avoiding Plagiarism in Markov Sequence GenerationAlexandre Papadopoulos1,2Pierre Roy1François Pachet1,212Sony CSL, 6 rue Amyot, 75005, Paris, FranceSorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, FranceAbstractMarkov processes are widely used to generate sequences that imitate a given style, using random walk.Random walk generates sequences by iteratively concatenating states to prefixes of length equal or less thanthe given Markov order. However, at higher orders,Markov chains tend to replicate chunks of the corpuswith a size possibly higher than the order, a primaryform of plagiarism. The Markov order defines a maximum length for training but not for generation. In theframework of constraint satisfaction (CSP), we introduce M AX O RDER. This global constraint ensures thatgenerated sequences do not include chunks larger thana given maximum order. We exhibit an automaton thatrecognises the solution set, with a size linear in the sizeof the corpus. We propose a linear-time procedure togenerate this automaton from a corpus and a given maxorder. We then use this automaton to achieve generalised arc consistency for the M AX O RDER constraint,holding on a sequence of size n, in O(n.T ) time, whereT is the size of the automaton. We illustrate our approach by generating text sequences from text corporawith a maximum order guarantee, effectively controlling plagiarism.p(si 1 s1 , . . . , si ) p(si 1 si k 1 , . . . , si ).In theory, higher order Markov models are equivalent toorder 1 models. However, in practice, higher order modelsoffer a better compromise between expressivity and representation cost. Variable order Markov models are often usedto produce sequences with varying degrees of similarity withthe corpus (Begleiter, El-Yaniv, and Yona 2004). Indeed, increasing the Markov order produces sequences that replicatelarger chunks of the original corpus, thereby improving theimpression of style imitation.However, it has also been long observed (Brooks et al.1957) that increasing the order tends to produce sequencesthat contain chunks of the corpus of size much larger thanthe Markov order.We illustrate this phenomenon on a text corpus: Johnston’s English translation of Pushkins Eugene Onegin – areference to Markov, as he used the same corpus (in Russian)for his pioneering studies. Here, an element of the Markovchain is a word of the text or a sentence separator, and asequence is a succession of such elements. With a Markovorder of 1, we obtain the following sequence:IntroductionMarkov chains are a powerful, widely-used techniqueto analyse and generate sequences that imitate a givenstyle (Brooks et al. 1957; Pinkerton 1956), with applicationsto many areas of automatic content generation such as music, text, line drawing and more generally any kind of sequential data. A typical use of such models is to generatenovel sequences that “look” like or “sound” like the original.From a corpus of finite-length sequences considered asrepresentative of the style of an author, a Markov model ofthe style is estimated based on the Markov hypothesis whichstates that the future state of a sequence depends only on thelast state, i.e.:p(si 1 s1 , . . . , si ) p(si 1 si ).The equation above describes a Markov model of order 1.The definition can be extended to higher orders by considering prefixes of length k greater than 1.Copyright c 2014, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.Praskovya re-baptized “Polina”. Walking her secret tome that rogue, backbiter,pantaloon, bribe-taker, glutton and still eats, and featherbeds, and enjoyment lockedhim all went inside a day wood below the flower was passion and theirs was onewho taught her handkerchief has measured off in caravan the finest printer with pulsesracing down, he’ll be nothing could draw it abounded.On top of the text, we draw the longest subsequences thatappear verbatim the corpus, or chunks, assigning differentcolours to different lengths. For example, this generated sequence contains the chunk “[.] that rogue, backbiter, pantaloon, bribe-taker, glutton and [.]”, which is a subsequenceof length 7 from the corpus. The maximum order of a se-

Markov order123456median229457878lower quartile2.02.06.028.077.078.0upper quartile3.03.013.065.079.079.0Table 1: Maximum orders for different Markov ordersquence is the maximum length of its chunks (7, in our example).If we increase the order to 3, we obtain sequences such asthe following one:Love’s frantic torments went on beating and racking with their strain and stress thatapproach guarantees a complete exploration of the space ofsequences. The variables of this CSP represent the elementsof the sequence to generate. Markov constraints enforce thatvariables in the sequence should have valid Markov continuations. Sequence generation is obtained by CSP resolution.Additional constraints can be easily introduced to control thegeneration of the sequence. In practice, the problem boilsdown to the identification of efficient arc-consistency algorithms for the given constraints.Under this framework, we introduce the M AX O RDERglobal constraint. It holds on all the variables of the CSP,and states that 1) the sequence is a valid order k Markovsequence, and 2) no L consecutive variables form a sequence that belongs to the training corpus. Following Pesant (2004) and Beldiceanu, Carlsson, and Petit (2004), weenforce generalised arc-consistency by feeding a constraintsuch as R EGULAR an automaton that accepts the set of suchsequences. However, we show that canonical methods arenot satisfactory for building this automaton. The main contribution of this paper is a linear-time algorithm that buildsthis automaton.youthful heart. It all seemed new – for two days only – the estate provides a settingRunning Examplefor angry heirs, as one, to admire him – and replies: Wait, I’ll present you – but insidea day, with custom, love would fade away. It’s right and proper that you transcend inWe consider the corpus made of ABRACADABRA, whereeach element is one of the symbols A, B, C, D or R. Withk 1, the Markov chain estimated on this corpus is givenby the following transition matrix:music’s own bewitching fashion the foreign words a maiden’s passion found for itsutterance that night directed his.This sequence makes arguably more sense than the onegenerated with order 1. However, its maximum order is 20(i.e. it contains a 20-word-long subsequence copied verbatim from the corpus). To any reader familiar with the corpus,this would read like blatant plagiarism.To illustrate this phenomenon in a more general way, wegenerated a few hundreds of sequences of varying order(from to 1 to 6). Table 1 shows the maximum order observedfor each Markov order: this value increases to values muchhigher than the Markov order. Markov order affects training: when estimated from a corpus, the Markov model learnsall continuations of sequences of k states or less. However,this parameter k does not limit the maximum order of thegenerated sequence. In particular, the whole corpus itself istrivially a valid Markov sequence of order k.This paper addresses precisely the problem of generatingMarkov sequences with an imposed maximum order. Suchsequences cannot be obtained with greedy approaches likerandom walk. However, the maximum order problem can beseen as a specific instance of a constrained Markov problem. Introduced by Pachet and Roy (2011), Markov constraints consist in reformulating the generation of Markovsequences as a combinatorial problem (CSP). By enforcing arc-consistency of the Markov constraints, we can solvesuch a CSP in a backtrack-free manner, thus simulating agreedy random walk generation procedure.As opposed to greedy approaches, the Markov constraintABCDRA 0 0 1 11B0.50000C0.250000D0.250000R0 1 0 00During a training phase, these probabilities are estimatedaccording to their frequency in the corpus. Here, in the fourcontinuations for A in the corpus, two are with B, one withC and one with D. A sequence is a Markov sequence, according to an order k Markov chain, if every k-gram of thesequence has a continuation with a non-zero probability. Forexample, ABRADABRACA is a valid Markov sequence,but ABRACADABA is not a valid Markov sequence, because the probability of having A after B is zero.Automaton RepresentationFollowing the works on language automata and global constraints, we can encode a set of Markovian sequences using an automaton. Global constraints such as R EGULAR canthen take this automaton as input to generate sequences.Definition 1 (Automata). A deterministic finite-state automaton, or, simply, automaton, is a quintuple A hQ, Σ, δ, q0 , F i, where: Q is a finite non-empty set of states; Σ is the alphabet – a finite non-empty set of symbols; q0 Q is the initial state of the automaton; δ is the transition function Q Σ Q, which maps astate to its successors for a given symbol;

F Q is the set of final, or accepting, states.Definition 2 (Accepted language). An automaton recognises, or accepts, a set of words, called a language, definedas follows.Figure 2 shows a maximum order automaton for the corpus ABRACADABRA, with k 1 and L 4. The labelsin the states correspond to prefixes of forbidden no-goods. A word w Σ is a sequence a1 . . . ap of symbols ai Σ. The word w is accepted by A if and only if there exists asequences q0 , . . . , qp of states, such that δ(qi 1 , ai ) qi ,for all i, 1 i p, and qp F . L(A) {w Σ w is accepted by A} is the acceptedlanguage of A.In order to represent order k Markov transitions, we createan alphabet Σ where each symbol corresponds to a unique kgram of the corpus. Then, a valid order k Markov transitionis represented by two symbols, such that their two corresponding k-grams overlap on their common k 1 symbols.A valid Markov sequence of length n is represented by aword of length n k 1 on this alphabet.For example, for k 2, the sequence ABRA correspondsto a sequence of three 2-grams hA, Bi, hB, Ri, hR, Ai. Wecan assign three symbols a1 , a2 , a3 Σ to those three2-grams in their respective order. The Markov transitionA, B R is represented by the word a1 a2 , and the sequence ABRA by the word a1 a2 a3 .Definition 3 (Markov Automaton). A Markov automatonassociated to a corpus is an automaton A such that L(A) {a1 . . . an Σ ai ai 1 is a Markov transition, i, 1 i n}Figure 1 shows a Markov automaton for the corpusABRACADABRA with k 1.BBRCADRDCAFigure 1: A Markov automaton for the ABRACADABRAcorpus, with k 1We can now represent the set of Markov sequences satisfying the maximum order property as the following automaton.Definition 4 (Maximum Order Automaton). Let M be aMarkov automaton for a given corpus. For a given no-gooda1 . . . aL N , let A(a1 . . . aL ) be an automaton such thatL(A(a1 . . . aL )) {w Σ a1 . . . aL is a factor of w}, i.e.the language of words containing at least one occurrence ofthe no-good. An automaton MO is a maximum order automaton for C and N if:\L(A(a1 . . . aL ))L(MO) L(M) a1 .aL NRRABBRCABRAACDARD BRARACBBABRABRBBDCBADAD DADACDDCACCCAACAAACADADBCADDABCFigure 2: A maximum order automaton for the ABRACADABRA corpus, with k 1 and L 4The M AX O RDER ConstraintThe purpose of M AX O RDER is to generate Markov sequences with a guaranteed maximum order. The constrainttakes two parameters C and N ; C Σ2 denotes a set ofvalid Markov transitions, each represented by an elementa1 a2 C corresponding to a valid pair of k-grams; N ΣLdenotes a set of forbidden sequences, or no-goods.In practice, each no-good of size L L0 k 1 corresponds to a sequence of size L0 that appears in the corpus,where k is the Markov order, and L0 is the maximum order.M AX O RDER is defined as follows.Definition 5 (M AX O RDER constraint). The constraintM AX O RDER(C, N , X1 , . . . , Xn ) holds iff: for each i, 1 i n, Xi Xi 1 C; for each i, 1 i n L0 1, Xi . . . Xi L0 1 6 N .For example consider the ABRACADABRA corpus, withk 1 and L 4. There are 7 no-goods: ABRA,BRAC, RACA, ACAD, CADA, ADAB, DABR. The Markovian sequence ABRADABRACA does not satisfy the maximum order property: it contains the no-goods ABRA,ADAB, DABR, BRAC, RACA. The Markovian sequenceRADADACAB does not contain any no-good of size 4, andso satisfies the maximum order property for L 4. In fact,any satisfying sequence is a substring of a string followingthe pattern BRA(DA) (CA) BR.Propagating the M AX O RDER ConstraintThere is a canonical way to propagate M AX O RDER by considering the maximum order automaton. Then, we can enforce the M AX O RDER constraint by imposing a sequence ofvariables to form a word recognised by this automaton, using the R EGULAR constraint (Pesant 2004). If T is the sizeof the automaton (the number of transitions), generalisedarc-consistency on a sequence of n variables is achieved inO(n.T ) time.

The maximum order automaton can be built using standard automata theory operations that implement Definition 4. Initially, we build a Markov automaton (we providein this paper an algorithm for doing this). Then, for each nogood, this automaton is intersected with the negation of theautomaton recognising sequences containing the no-good.However, the complexity of this procedure is dominatedby the complexity of intersecting a number of automata. IfO(t) is the size of any of the automata, and N N is thenumber of no-goods, the complexity is O(tN ). It is unlikelythat an algorithm with a better complexity exist (Karakostas,Lipton, and Viglas 2000; 2003). Furthermore, this methoddoes not give any bound on the size of the final automaton(other than O(tN )). In the following section, we propose alinear-time algorithm to build this automaton.Algorithm 1: Markov automatonData: C the set of valid Markovian transitionsResult: M the minimal Markov automatonAlgorithmsIn this section, we build the maximum order automaton intime linear in the size of the input, the set of no-goods. As acorollary, we show that the size T of this automaton is linear in the size of the input, and, therefore, that propagatingthe M AX O RDER is polynomial too. To this end, we introduce two algorithms. The first algorithm builds the minimalMarkov automaton. The second algorithm builds a trie withthe no-goods, computes their overlaps, and uses it to removefrom the Markov automaton all sequences containing a nogood.Markov automatonThe first algorithm consists in building an automaton recognising all Markovian sequences, i.e. sequences where anytwo successive k-grams correspond to a k 1-gram of thecorpus, for a Markov order of k. This automaton, whenviewed as a labelled directed graph, is essentially a syntacticrewrite of the Markov chain, with a different semantics attached to its nodes and edges. In this automaton, each transition is labelled by a symbol corresponding to a Markovstate. A notable property of a Markov automaton is that alltransitions labelled with a given symbol point to the samestate.Definition 6 (State labels). We use the following notation torelate states and the labels of its ingoing transitions. Let q be a state, a(q) {a Σ q 0 Q, δ(q 0 , a) q} isthe set of labels of the transitions pointing to q. Let a be a symbol of the alphabet, Q(a) is the unique stateq such that a a(q).The interpretation of a Markov automaton A can be statedformally as follows. Let a1 , a2 Σ be two symbols. AMarkov transition between the k-grams corresponding to a1and a2 is represented in A by a transition between Q(a1 ) andQ(a2 ), labelled by a2 . Intuitively, when at state q Q(a1 ),we can observe any one of the Markov states in a(q). If weobserve a1 , we can generate a2 . Since the automaton acceptssequences of arbitrary length, all states are accepting.We build the Markov automaton using Algorithm 1. Firsta state is created that observes any Markov state (Lines 2to 7). Then, each Markov transition is inserted iteratively.1234567891011121314151617181920212223M hQ, Σ, δ, q0 , F iq NewState(Q)a(q) forall the a Σ doδ(q0 , a) qQ(a) q; a(q) a(q) {a}F F {q0 , q}forall the a1 a2 C doq1 Q(a1 )q separate(q1 , a1 )q2 Q(a2 )δ(q, a2 ) q2if q 0 Q such that q and q 0 are equivalent thenMerge q with q 0Q(a1 ) q 0 ; a(q 0 ) a(q 0 ) a(q)function separate(q1 , a1 )q NewState(A)accept(q) Trueforall the a Σ such that δ(q1 , a) is def doδ(q, a) δ(q1 , a)forall the q 0 Q such that δ(q 0 , a1 ) q1 doδ(q 0 , a1 ) qQ(a1 ) q; a(q) {a1 }

When inserting (a1 , a2 ), a new state q is created using theseparate() function. This function creates a new state,which has the same outgoing transitions as q1 , and redirectsall ingoing a1 -labelled transitions from q1 to the newly created state. An a2 -labelled transition can be added. Minimality is incrementally maintained by assuming the invariantthat, before the call to separate(), all states are pairwise non-equivalent. The creation of q does not affect theequivalency of any other state (in particular, not the predecessors of q and q1 ). After we add an a2 -labelled transitionto q, q ceases to be equivalent with q1 . However, with theaddition of the transition, it might have become equivalentto a (unique) other state, in which case we merge them, thusmaintaining the invariant.The size of the resulting automaton and the running timeto build it, are both bounded by the number of transitions inC.Maximum Order AutomatonThe second algorithm consists in removing from the Markovautomaton any sequence that contains at least one no-good,i.e. a sequence of forbidden length appearing in the corpus.This is performed by Algorithm 2. Intuitively, a trie of theno-goods is computed, where all states but the ones corresponding to a full no-good are accepting states. This ensuresthat a no-good is never accepted. However, this is not sufficient. The key part of the algorithm is to add the connectionbetween overlapping prefixes. For example, if we have twono-goods ABCD and BCEF, the prefixes ABC and BCEFoverlap on BC. This means that the automaton should notaccept ABCD, but it should not accept ABCEF either. Thisconnection is made using cross-prefix transitions. In order toachieve this, Algorithm 2 uses an adaptation of the Aho andCorasick (1975) string-matching algorithm.In more details, this algorithm first disconnects fromthe Markov automaton any transition that starts a no-good.Then, those transitions are extended to a full trie of all theno-goods. For a state q of the trie, w(q) records the prefix recognised by this state, i.e. the word accepted from q0to q. By excluding the states that recognise a complete nogood (line 20), we guarantee that no sequence including anyno-good will be recognised. However, we are not removingthose states yet. We first have to deal with a key aspect of thisalgorithm, which concerns the overlap between no-goods.For example, with two no-goods ABCD and BCEF, supposing we started ABC, we cannot continue with D, as thiswould complete a no-good. However, if we continue with E,we start the no-good BCEF. Therefore, a cross-prefix transition between the state for ABC and the state for BCE mustbe added. Cross-prefix transitions ensure that, by avoiding aparticular no-good, we will n

Avoiding Plagiarism in Markov Sequence Generation Alexandre Papadopoulos 1;2 Pierre Roy 1Sony CSL, 6 rue Amyot, 75005, Paris, France 2Sorbonne Universit es, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France Franc ois Pachet1;2

Related Documents:

Lecture 2: Markov Decision Processes Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. De nition A Markov Decision Process is a tuple hS;A;P;R; i Sis a nite set of states Ais a nite set of actions

How to deal with plagiarism / 13 May 2016 Overview of this session What plagiarism is and why it is damaging The difference between plagiarism and text recycling The use and limitations of plagiarism detection software How to handle plagiarism in line with COPE guidelines

common. Complete plagiarism, for example, was believed to be the most serious yet the least common. The most serious form of plagiarism that was also ranked most common was verbatim plagiarism. The chart below reflects the percentage of those who deemed each form of “serious” plagiarism and attribution issues “common” as well.

The word Plagiarism is taken from the word plagiaries, a kidnapper. Plagiarism is not considered as Infringement of copyright. In other words, plagiarism is an act of fraud. It involves both stealing someone else's work and lying about it afterward. Plagiarism by students and researchers in academic and research institutions is an old

plagiarism with absolute reliability. "It is important to note that electronic plagiarism detection cannot solve the problem of plagiarism. Detection should be used as part of a wider approach to prevention. With this in mind, the JISC are also supporting a plagiarism advisory service based at Northumbria University. We strongly

The Markov Chain Monte Carlo Revolution Persi Diaconis Abstract The use of simulation for high dimensional intractable computations has revolutionized applied math-ematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through micro-local analysis. 1 IntroductionCited by: 343Page Count: 24File Size: 775KBAuthor: Persi DiaconisExplore furtherA simple introduction to Markov Chain Monte–Carlo .link.springer.comHidden Markov Models - Tutorial And Examplewww.tutorialandexample.comA Gentle Introduction to Markov Chain Monte Carlo for .machinelearningmastery.comMarkov Chain Monte Carlo Lecture Noteswww.stat.umn.eduA Zero-Math Introduction to Markov Chain Monte Carlo .towardsdatascience.comRecommended to you b

Incorporating Sources and Avoiding Plagiarism . Graduate Resource Center . Graduate Resource Center . grc@uci.edu. Produced by Christine King . Plagiarism Quiz “Handing in significant parts or the whole of a paper or article form an author other than myself, granted that I

repair genes) in the datasets created in this research are as follows: ageing-related DNA repair genes‟ protein products tend to interact with a considerably larger number of proteins; their protein products are much more likely to interact with WRN (a protein whose defect causes the Werner‟s progeroid syndrome) and XRCC5 (KU80, a key protein in the initiation of DNA double-strand repair .