Wiki Search Engine

A Search Engine Wiki Dump

download .ZIP download .TGZ

Kalpish Singhal

           Indexing mechanism for search egine

-> The code basically implement INDEX CREATION MODULE for seach engin

-> It reads the content of XML File and SAX Parser passes it

-> Character by character it scans the text and extracts words for indexing

-> sorted posting list on term frequency

-> Amended every word with D.F in indexing

           Files LIST With One Line Disciption

(i) Driver.java //contains main function takes Xml "wiki dump" as comandline input and calls Handller_pars for handling

(ii) Handller_pars.java //contains parsing handling mechanism to remove scrap and send words for parsing into map

(iii)help_parsing.java //helps in parsing the content removing stop words stemming it using Stemmer.java

(iv) Stemmer.java //stemm the word to its root form and return it to the calling function

(iv) Writeing_to_file.java //it writes o/p to the given comaand line input filr if file not exist it will create new file

(v) listing.java //Class defination for using ojects to keep count to be maped into the map

vi) merge_index.java //this merges the complete index files created by writing the file

Vii) post_index_processing.java //This processes the merged index to built sorted index with DF of each term

viii) Driver_title.java //This is main class to extract title corrsponding to the document id and save to file#this can be merged with driver.java but title document number mapping where not initially part of the statement project

ix) Map_title.java//this writes title to the file with wiki_doc_id

X) title_id_processor.java //processes the titel built above to make secondary and tertiary title_index

xi)Query_processor.java //used for proceesing user quries to give the desired document

            Implemented characteristics

-> Used tree Hash map for SortedMap order of the keys will be sorted. so sorted according to word A-Z

-> Not storing zeros in index file. helps in reducing

-> One go parssing :- data is parsed in one go to reduce time

-> Stop words are removed to make index efficent except from title in the view for seacrch of titles like "to be not to be" type query

-> infobox is parsed to get relevent data.

->Query time: <=1sec per query Query can be single word (Sachin) or multi-word(Sachin Tendulkar India) or fielded query(t:sachin b:india), where t=title,b=body,c=category,e:external links,i:info-box

            Output Format

Index file

-> word-?doc_id:,,,,;....;

eg:-ledzi-1?6256:,,,1,; it represnt ledzi is in document 6256 1 time in body .

Query processor

a) enter the number of quarries you want to process:eg:-2

b) enter your query:

c) Time taken in Result Generation: ...milisec

Assumption

i) xml-wiki dump #present in parent directory of code directory



ii) output folders should be present in parent directory