Web Structured Extraction

webMining binaries

Usage

For offline documents:

./run.sh file.html

For online documents, using remotely controlled browser (Chrome + w3c web driver):

./webDriver.sh http://host/uri

Example script (basic.lua)

To run the script bellow, open chromedriver in another terminal and:

./webMining -debug -i basic.lua http://url > output.txt 2> output.debug.txt

basic.lua

CRLF = '\n'
extractURL = function(url, html, minPSD, minCV) -- loads and parses HTML into a DOM tree 
	local dom = DOM.new(url, html)

	-- instantiates an extractor
	local dsre = DSRE.new() 

	-- sets z-score and CV thresholds
	dsre:setMinPSD(minPSD)
	dsre:setMinCV(minCV)

	-- extract records
	dsre:extract(dom)

	local regions = dsre:regionCount()

	-- iterates over regions
	for i=1,regions do
		local dr = dsre:getDataRegion(i-1)
		local rows = dr:recordCount()
		local cols = dr:recordSize()
		local content = ''
		if dr:isContent() then
			content = 'CONTENT DETECTED'
		end
		print('Region #', i, content, ': ')

	-- iterates over current region's
	-- rows and columns
		for r=1,rows do  
			local record = dr:getRecord(r-1)
			for c=1,cols do
	-- output field value
				if (record[c] ~= nil) then
					io.write(record[c]:toString())
				end
				io.write(';')
			end
			io.write(CRLF)
		end
	end
end

local driver = webDriver.chrome

if #args > 4 then
	local url = args[5]
	driver:newSession()
	driver:go(url)
	local html = driver:getPageSource()
	extractURL(url, html, 9.0, 0.30)
end

Publications

Paper published in CIKM'17

Paper published in ICWE'19

Paper published in SBBD'20

Building from source

make

deps (Ubuntu and Windows)

libcurl3-dev, liblua5.3-dev, python3-dev

Name		Name	Last commit message	Last commit date
Latest commit History 823 Commits
webMining		webMining
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webMining

webMining

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Web Structured Extraction

webMining binaries

Usage

Example script (basic.lua)

basic.lua

Publications

Building from source

deps (Ubuntu and Windows)

About

Releases

Packages

Languages

rpvelloso/webMining

Folders and files

Latest commit

History

Repository files navigation

Web Structured Extraction

webMining binaries

Usage

Example script (basic.lua)

basic.lua

Publications

Building from source

deps (Ubuntu and Windows)

About

Resources

Stars

Watchers

Forks

Languages