Comprehensive Study And Comparison Of Information .

3y ago
15 Views
2 Downloads
466.62 KB
9 Pages
Last View : 16d ago
Last Download : 3m ago
Upload by : Olive Grimm
Transcription

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 7, No. 1, 2016Comprehensive Study and Comparison ofInformation Retrieval Indexing TechniquesZohair MalkiInformation Systems DepartmentThe Collage of Computer Science and Engineering in YanbuTaibah University,Saudi ArabiaAbstract—This research is aimed at comparing techniques ofindexing that exist in the current information retrieval processes.The techniques being inverted files, suffix trees, and signaturefiles will be critically described and discussed. The differencesthat occur in their use will be discussed. The performance andstability of each indexing technique will be critically studied andcompared with the rest of the techniques. The paper also aims atshowing by the end the role that indexing plays in the process ofretrieving information. It is a comparison of the three indexingtechniques that will be introduced in this paper. However, thedetails arising from the detailed comparison will also enhancemore understanding of the indexing techniques.Keywords—Information Retrieval; IndexingInverted Files; Suffix Trees; Signature FilesI.Techniques;INTRODUCTIONInformation retrieval refers to the process of obtainingrelevant information from an existing database that consists ofdifferent data that has been collected together. The current stateof information retrieval depicts the existence of two searchindexing. The first one is metadata and the second one is fulltext. Metadata is formally outlined as data about other sets ofdata [1]. It is more precisely described as information regardingother information that is structured. Metadata tool ofinformation retrieval does not take into consideration thecomplexity of the search question. It can give relevant resultsto a simple query like the name of an author of a certain book.It can also provide relevant objects to other queries that arecomplex, like geographical codes. It is usually mostly utilizedin education institutions in libraries other resources with largedatabases [1]. Catalogs of Libraries represent a remotemetadata. Reviews of books, art collections as well assummaries also take the form of remote metadata.On the other hand, full-text tool of retrieving informationrefers to the use of techniques that search documents storedfrom single computer. It can also refer to the use of techniquesthat search of a document that exists in a collection ofdocuments in a collection of full-text database. A full-textsearch usually performs an examination of all the words thatexists in the documents stored in the attempt of matchingcriteria of searching [1]. Over the years, it has been commonlyused in online searches from databases of bibliography. Mostapplication programs, as well as websites, provide capabilitiesof the full-text searches. Most search engines of the webusually employ techniques of full-text search. However, thereare others that partially index the web pages. The onlycondition is that the web pages must undergo examination bytheir indexing systems.Baeza-Yates and Ribeiro-Neto [2] explain that indexingwith full text usually depends on the number of documents.Small numbers of documents can prompt direct scanning of thecontents. A strategy known as serial scanning is applied to eachquery. Serial scanning is the protocol that is usually followedby most tools in the searching process. An example of suchtools is Grep, which uses the strategy of serial scanning.Potential largeness of documents or increase of the quantity ofqueries to search prompts the division of full-text searchingprocess into two stages. The first is indexing and the secondone is searching. The first stage of indexing focuses on thescanning of all the existing documents. The stage also sees thebuilding of a search term list. This list of search terms that isusually built at the indexing stage is referred to as an index.However, there are people refer it to as a concordance. Thesecond stage of known as search only references the index inthe performance of specific queries. The stage does notreference the text of the documents that are original.However, this research will focus on a study of theindexing stage and the various techniques that are applied inthe process.There are several indexing techniques in informationretrieval. However, this research is going to focus on threeindexing techniques namely inverted files, suffix trees, andsignature files. The three are the most commonly usedtechniques in the current world of information retrieval. Theprocess of retrieving information usually begins with a queryfrom a user into the system. A query is a statement that isformal indicating the need for particular information. Anexample of a query is the search of information in onlinesearch engines. Information retrieval queries stated by users donot usually offer a specific object or solution to the problem.Rather, it gives a collection of related objects that match theproblem stated in the query. However, the objects havedifferent levels of relevance to the query. Depending on thetechnique that is used, the relevance of available information isdetermined with respect to the entered queries. The resultsgiven in a form of objects are based on their relevance to thequeries. The techniques have proven to the most reliable andusually generate desirable results. However, the indexingtechniques differ in many ways. They usually differ in the way132 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 7, No. 1, 2016they perform the relevancy tests. They also differ in theirsimplicity of application. The indexing that is performed by thetechniques does not take a similar route. The research in thepaper will outline the processes of indexing that the varioustechniques undertake.Accuracy is a key factor in retrieving information [3].Users expect accurate answers from the objects offered inrespect to their queries. The accuracy expectation usually cutsacross all the information retrieval methods as well as indexingtechniques. Users expect to have accurate information nomatter the technique that is used in the indexing process.However, it will be shown in the paper that the techniquesdiffer in their accuracy. This research will compare theaccuracy levels of the three mentioned techniques.Despite the high preference of inverted files, suffix trees aswell as signature files, they all have limitations. There arevarious challenges that are associated with each technique ofindexing. The level of challenges associated with theapplication of each technique will be measured in the paper. Adetailed comparison of the challenges will offer anunderstanding about which technique is more limited ascompared to the others. Further, the benefits associated withthe use of each technique will be outlined. Each indexingtechnique has benefits that are associated with its use. Thesebenefits and advantages will be critically evaluated andcompared. This comparison will offer information about whichtechnique among the three has the most accrued benefits uponits application in the process of retrieving information. Finally,each indexing technique has an objective. The objectives of thevarious techniques differ across the techniques. This paper willalso undertake a study of the main objectives of the techniqueswhich they focus their performance.II.INDEXING TECHNIQUESA basic definition of indexing was given in 1988 by Salton[4] as the facilitation of information retrieval accuracy bycollecting, parsing and storing data. The accuracy facilitation isperformed by use of various methods and techniques. Asearlier stated, users need accuracy in the information retrievalprocess. The indexing process usually has an incorporationmechanism that allows use of concepts from variousdisciplines. It has been stated that there exists manyinformation retrieval techniques with the common ones beinginverted files, suffix trees as well as signature files. Thissection will discuss each technique into details as well as theway they work.A. Inverted FilesInverted files are defined as central components of anindexing algorithm in a search engine. The engine that searchesinformation has a goal of query speed optimization. Thismeans finding documents where a certain word occurs. Thenthe next step is developing a forward index. The index that isdeveloped plays a role of storing the lists of words in everydocument. The document is then inverted, leading to adeveloped inverted index. Sequential iteration is usuallyrequired in order to query the forward index. In 2006, Belew[5] suggested that the iteration requires to be performed in eachword and document in order to allow the verification of adocument that matches the query. Technically, the resource interms of time and memory that is required in the performanceof such a query lacks an aspect of being realistic. However, thestructure of the inverted files that is developed lists thedocuments per every word. This is done in place of listing thevice versa, where the words would be listed per everydocument. To perform a clear illustration of the inverted fileconcept, we assume an existing set of documents. Further, weassume that every document in the set is assigned a list thatcomprises of keywords. These keywords can also be referred toas attributes. We also assume that there are optional weights ofrelevance for every keyword. With the assumptions, the sortedlist of keywords will be the inverted file. Each attribute willhave a link to the documents that contains the specifickeyword. Fig. 1 shows how the concept of inverted files works[5].Fig. 1. Inverted files, the index file contains all words in the document andtheir index, the posting file contains a link to each word with thecorresponding frequency, the document file contains the documentsAccording to Belew [5], the concept of inverted files ismostly used in library systems that are commercial. It is alsoused in libraries that belong to various education institutions.The reason for the popularity is that inverted files haveenhanced efficiency in searching. Basically, the efficiencyassociated with inverted files is usually necessary when dealingwith files that comprise large texts. This is the case for suchinstitutions, which justifies their preference of inverted files.1) Structures Used in the Inverted Files: There are severalstructures that are usually used in the implementation ofinverted files. The most commonly used structures are sortedarrays, B-trees as well as tries. These structures will bediscussed in this section. This will help in giving moreinformation about the concept of inverted files. It will givemore understanding of the relationship between inverted filesand the various structures that are used in the files’implementation process.a) Sorted Arrays: The implementation of an invertedfile through this structure enables the file to support storage ofkeywords’ lists in a sorted array. This includes severaldocuments that are associated with each attribute. Further, italso includes a link to the documents that usually contain theattributes. Primary storage based systems use a binary searchthat is standard and are the most commonly used in searchinga sorted array. On the other hand, systems that are based onsecondary storage usually adapt the sorted array in conformityto their secondary storage’s characteristics. Fig. 2. shows thestructure sorted arrays as outlined by Barto, et al [1].133 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 7, No. 1, 2016Fig. 2. Sorted arrays, The terms are sorted in lexical ascending order, then the duplicate words are removedThe figure contains two documents, Doc#1 and Doc#2. Thewords in each document are extracted and inserted in table.The words from both documents are sorted and possibleduplication in single document is removed. For example theword trees are repeated in document 1, so it has been removed.The word results are found in both documents so it can't beremoved.The sorted arrays structure has an easy implementationprocess. It has a reasonable speed that enhances itsperformance. However, the structure is limited in that itrequires frequent update of the index. The frequent updatingsometimes is expensive.b) B-trees: The most common type of the B-treestructure is prefix B tree. It utilizes word prefixes as theprimary keys in an index of B-tree. This makes it wellstructured for the storage of indices that are textual. Everynode that is internal usually carries a number of keys that arevariable. The shortest word distinguishing the keys stored inthe next level is usually named as the key. It is not necessaryfor the key to be a prefix in the index that is an actual term.The last level in the structure is known as leaf level, as shownin Fig 3. It carries the mandate of storing the attributes withthe data associated with them. The order of every node of theprefix B-tree varies because there is dependence on attributesby the internal node keys as well as their lengths [6].Fig. 3 shows simple Prefix B tree, the first level containstwo keys, B and T. The two keys represent separators of thefollowing leaves, Words beginning with letter less than or equal B suchas Ar and Am, Words between B and T such as Co Fi and Ja, Words after T such as Un and Wa.The second level represents other keys for the leavesbeneath them, and so on. The last level contains the words ofthe documents with pointers to the corresponding document.The B-Tree requires continuous update for maintenance ofbalance in the tree. The structure has a limitation in that it isnot capable of handling many words within the same prefix.The B-tree method is broken down in cases of multiple words.The prefixes that are common usually call for division to avoidspace wastage. B-trees usually occupy more space ascompared to sorted arrays. However, updates are easier toimplement and are faster in comparison with the sorted arrays[6].Fig. 3. Binary Tree indicating three levels with keys rpresenting each nodec) Tries: The structure’s name was generated from theword retrieval. This structure is widely used to implementinverted files. Digital decompositions of the attributes arehighly used by the structure in the representation of the same134 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 7, No. 1, 2016keywords [1]. In structures of tries, keys associated to specificnodes are not stored by the nodes. There is similarity of theprefixes of a string in all the descendants of a particular node.There is no necessity of associating values with every node.Values are instead associated with leaves alongside severalinner nodes that have correspondence to keys of interest [7].Fig. 4 shows an example of the Trie structure.children with the exceptional of the root. The labeling of theedge is done with non empty S substrings [13]. Any two edgesthat start out of a node should have strings labels that start witha different character. This condition means that it is notpossible for a suffix to be a prefix that is proper for another.The digit that comes last in the data is a, and it appears twotimes in the data. Lastly, suffix S[i.n] is spelt out by the stringobtained after the concatenation of string labels. These labelsare the one present in the path of root to leaf.Let us assume a string s peeper. The non empty suffixesof the string will be peeper, eeper, eper, per, er and lastly r.Developing a suffix tree for the string peeper will comprise acompressed trie containing elements peeper, eeper, eper, per, erand r. The alphabet of the string is e, p, and r. This means thatthe radix of the trie that is compressed is 3. Fig. 5 based on [10]indicates a Trie for suffixes of the word "peeper".Fig. 4. An example of Trie structure showing no association of values withevery node. Valures are instead associated with leaves alongside several innernodesThe example above shows listing of some keys in the nodeswhile the values are indicated below the nodes. There is anarbitrary integer value associated with every English word thatis complete. The example reveals a Trie as a finite automatonthat is deterministic and tree shaped. There is generation offinite language by automaton tries. Further, compression ofeach trie into a state automaton that is deterministic acyclic isimplemented as clearly shown in the above example. Tries arealso fast in terms of the time used in implementing invertedfiles. It is also easy to implement as the example is easy tounderstand.B. Suffix TreesA suffix tree is a Trie that is compressed and containssuffixes of the given texts as the keys that belong to them aswell as their values as the positions present in the text. The ideaof compressing tries makes suffix trees be referred to as tries.Consequently, the sub-trees are referred to as sub-tries. Theconcept of suffix trees was developed in the year 1973 byWeiner [8]. The first online construction of suffix was to bedeveloped by Ukkonen [9].The running time associated with the algorithm was rankedas one of the fastest at that time. However, the algorithms wereall linear-time for a size alphabet that was constant [10].Generally, they had a running time of.In 1997,Farach [11] designed an algorithm of suffix tree constructionthat had optimism for all alphabets. It was the first algorithm oflinear-time for strings that were drawn from integers of thealphabet, in a range that was polynomial. This was thefoundation of new algorithms that have been later developed inthe construction of both suffix trees as well as suffix arrays.Assuming a suffix tree for the string and length , thedefinition must meet several requirements [12]. Firstly, theremust be exactly n leaves that are numbered from 1 to n in thetree. Every node that is internal must also have at least twoFig. 5. Trie representation of the word "peeper", the compressed version theTrie is done with eliminating leaf white nodesThe use of suffix trees is applied when solving multiplestring problems occurring in free text search as well as textediting. Suffix trees are also used in computational biology aswell as other areas of application. However, there are severalprimary main applications of suffix trees. Firstly, they performa search of a string in O (m) complexity. In such an application,M represents the substring’s length. However, there is amandatory requirement that there be time O (n) sufficient forbuilding the string’s suffix tree. Secondly, suffix trees are usedto find the repeated string that is longest. It is also applied inthe process of finding the common substring that is longest.Lastly, the longest palindrome in a string is found throughapplication of suffix trees [14].The above mentioned applications are useful as theyexpand the use of suffix trees. They enable them to be used inreal life processes. For example, they are widely used inbioinformatics applications. They are also widely used in thesearch of DNA patterns as well as sequences of proteins.135 P a g ewww.ijacsa.thesai.org

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 7, No. 1, 2016Fig. 6. The signature file is created by hashing every uncommon word to a given number of bits with fixed widthHowever, the sequences must be viewed as characters thatare long stringed. According to Callan et al [15] the greatestadvantage that makes suffix trees popular is their ability tomake long searches with minimal mismatches. This makesthem candidates to be used in data compression whey theyenable the finding of data that is repeated. Lastly, most searchengines also use suffix trees in the process of clustering data.C. Signature FilesA signature file is a technique of indexing that usuallycreates a filter that is “dirty”. An example of such a filter is theBloom filter that keeps all the existing documents m

Information Retrieval Indexing Techniques Zohair Malki . It is a comparison of the three indexing techniques that will be introduced in this paper. However, the details arising from the detailed comparison will also enhance more understanding of the indexing techniques. . collecting, parsing and storing data. The accuracy facilitation is

Related Documents:

Comparison table descriptions 8 Water bill comparison summary (table 3) 10 Wastewater bill comparison summary (table 4) 11 Combined bill comparison summary (table 5) 12 Water bill comparison – Phoenix Metro chart 13 Water bill comparison – Southwest Region chart 14

chart no. title page no. 1 age distribution 55 2 sex distribution 56 3 weight distribution 57 4 comparison of asa 58 5 comparison of mpc 59 6 comparison of trends of heart rate 61 7 comparison of trends of systolic blood pressure 64 8 comparison of trends of diastolic blood pressure 68 9 comparison of trends of mean arterial pressure

Water bill comparison summary (table 3) 10 Wastewater bill comparison summary (table 4) 11 Combined bill comparison summary (table 5) 12 Water bill comparison - Phoenix Metro chart 13 Water bill comparison - Southwest Region chart 14 Water bill comparison - 20 largest US cities chart 15

figure 8.29 sqt comparison map: superior bay (top of sediment, 0-0.5 ft) figure 8.30 sqt comparison map: 21st avenue bay figure 8.31 sqt comparison map: agp slip figure 8.32 sqt comparison map: azcon slip figure 8.33 sqt comparison map: boat landing figure 8.34 sqt comparison map: cargill slip figure

Middle School - Functional Skills and Adaptive Functional Skills Classes Class Type Abbreviation Comprehensive English ENG Comprehensive Reading READ Comprehensive Independent Living Skills ILS Comprehensive Mathematics MATH Comprehensive Science SCI Comprehensive Social Studies SS 20

2.1 A comparison of the existing bus ticketing systems 14 2.2 Comparison between Linux, Window and Mac 18 2.3 Comparison between Chrome , Mozilla and IE 20 2.4 Comparison between PHP,ASP.NET and JSP 22 2.5 Comparison between MySQL and Oracle 24 3.1 Data dictionary for AgentBasicInfotable 44 3.2 Data dictionary for feedbacktable 45

JavaScript, and in most other programming languages, conditional statements ask a question by using comparison operators. Before we discuss the syntax of the if statement, we need to explore the topic of comparison operators. Comparison operators Comparison operators are used to make comparisons. For example you can compare two

Sten 2: higher than about 5% of the comparison group Sten 3: higher than about 10% of the comparison group Sten 4: higher than about 25% of the comparison group Sten 5: higher than about 40% of the comparison group Sten 6: higher than about 60% of the comparison group Sten