Skip to content

ElninoFong/r10

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R10 - a web search engine

Assignment of CS6913

Crawler File List


|- config.conf: The configuration file of this program.
|- crawler.log: The log file of this program. The log file name can be configured.
|- crawler.py: The entry of this program.
|- crconfig.py: Read the configuration and do the MIME type checks.
|- crdb.py: Save the pages to the HardDisk and read it to the memory.
|- crdownloader.py: Downloader of the crawler.
|- crlogger.py: Return a logger.
|- crparser.py: Parser of the crawler.
|- crrobot.py: Cache the robots for different sites and do the robot rules check for a specific URL. Multi-thread safe.
|- crscheduler.py: Maintain the loaded URLs and the URLs queue to be downloaded. Multi-thread safe.
|- crsummary.py: Maintain the result list and write to a summry file.
|- googleapi.py: Get the inital URLs from google using google's api.
|- page.py: Maintain the page data structure.
|- singleton.py: Define the singleton attribute.
|- explain.txt: Explain file of this program.
|- readme.txt Readme file of this program.
|- htmlData The folder to save the downloaded pages. I do NOT put the downloaded pages here to reduce the package size.

Crawler Usage


python crawler.py [-h] -q QUERY_STRING [QUERY_STRING ...] -n PAGES_NUMBER
example:
python crawler.py -q poly nyu -n 500