Search

This is a tool that, using an index file created with indexer, evaluates queries and shows the files where words were listed. It's a search engine.

Design

The search tool is a simple REPL that uses an index file as a source. The source is broken up into two parts. A function to create a hash table using the contents of a given index file, and a REPL to parse and fulfil queries using said hash table. Since many parts of the table are already fairly modular, the only troubles were regarding how to impliment AND and OR.

OR was the simplest. For every word after so , the REPL would look for said word in the hash table, go through every listed file that it had occurred in, and add it to a sorted list if the file was not in the list already.

AND required the use of a second sorted list. The first word would trigger all of it's file occurrences to be added to list 1. Then the second, and every subsequent word, would check all the files that the word exists in, and only add them to the second list if they exist in the first. That way, only files that have contained all the words are remaining.

The main data structures used were a sorted linked list that could become a set, a hash table, and several specialized linked lists. The hash table houses nodes that can form a linked list, to handle collisions, that each contained a pointer to a linked list that listed all of the text files that contained the word in their contents. The sorted linked list was used to uniquely add file names to then later print out to the user.

Valgrind Complaints

There are no valgrind complaints. Valgrind testing is part of the build process.

Profiling

I used valgrind to ensure that my code ran without memory leaks and errors. After I had ensured that everything was in line, I used the callgrind tool provided by valgrind ($ valgrind --tool=callgrind) to find the various costs of calling the various functions in my code.

Unsurprisingly, various ld (linker) and libc (glibc) had created a lot of overhead and accounted for more than half of the programs running cost. The trade off is welcome, since libc and ld perform many useful tasks that have been heavily optimized over the decades.

One of the clear targets to further optimize the program is finding a more efficient way to use fscanf. Changes to the index file layout had cut down the number of calls to fscanf, though I fear that the viability of optimizing fscanf calls has reached a plateau.

Another optimization that comes to mind is using mergesort to sort the various linked lists used in the application. For large numbers of strings, there is a clear bottleneck in this area. Furthermore, indexing the file names as part of the index_file protocol, could lead to faster sorting. Number comparisons are usually much faster than string comparisons.

These optimizations will be kept in mind during development in the near future.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
indexer_src		indexer_src
search_src		search_src
test_index_dir		test_index_dir
.DS_Store		.DS_Store
.gitignore		.gitignore
Makefile		Makefile
make.sh		make.sh
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexer_src

indexer_src

search_src

search_src

test_index_dir

test_index_dir

.DS_Store

.DS_Store

.gitignore

.gitignore

Makefile

Makefile

make.sh

make.sh

readme.md

readme.md

Repository files navigation

Search

Design

Valgrind Complaints

Profiling

Authors: Artem Titoulenko, Jeffrey Adler

About

Releases

Packages

Languages

ox/search

Folders and files

Latest commit

History

Repository files navigation

Search

Design

Valgrind Complaints

Profiling

Authors: Artem Titoulenko, Jeffrey Adler

About

Resources

Stars

Watchers

Forks

Languages