pang_kaiju

A web spidering system, that processes webpage information on the fly and stores the processed information (title, keywords, description, other meta-tags, charset, language etc.) in a MongoDB database and stores outgoing links as edges in a MySQL table. It employs a RabbitMQ queue (if avialable) for visiting URLs and uses a Redis server (if available) to store visited URLs. In case RabbitMQ and/or Redis servers are not avaialble, it relies on STXXL library to implement a large queue and a map structure to store urls to visit and visited urls (respectively).

To compile the application in linux issue the following command after changing directory to the folder containing all the source files :

g++ -std=c++11 main.cpp liboptions.cpp spider.cpp rabbitmq.cpp redisclient.cpp htmlparser.cpp uriparse.cpp tokenize.cpp curlclass.cpp md5.cpp -o pang_kaiju -lboost_regex -lboost_filesystem -lboost_system -lboost_thread -lcurl -lgumbo -lstxxl_debug  -lmysqlcppconn -lmysqlclient -lpthread -lrabbitmq -lmongoclient -liconv -lssl -lcrypto

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
curlclass.cpp		curlclass.cpp
curlclass.hpp		curlclass.hpp
htmlparser.cpp		htmlparser.cpp
htmlparser.hpp		htmlparser.hpp
liboptions.cpp		liboptions.cpp
liboptions.hpp		liboptions.hpp
main.cpp		main.cpp
md5.cpp		md5.cpp
md5.hpp		md5.hpp
rabbitmq.cpp		rabbitmq.cpp
rabbitmq.hpp		rabbitmq.hpp
redisclient.cpp		redisclient.cpp
redisclient.hpp		redisclient.hpp
spider.cpp		spider.cpp
spider.hpp		spider.hpp
tokenize.cpp		tokenize.cpp
tokenize.hpp		tokenize.hpp
trim.hpp		trim.hpp
uriparse.cpp		uriparse.cpp
uriparse.hpp		uriparse.hpp

License

sushanttripathy/pang_kaiju

Folders and files

Latest commit

History

Repository files navigation

pang_kaiju

About

Resources

License

Stars

Watchers

Forks

Languages