qdstreaming

Query-driven entity resolution over a streaming data set

Compile

You need to have spark install and running somewhere

sbt compile
sbt -mem 8000 assembly

This creates an assembly jar that has all the qdstreaming package. To launch a class edu.ufl.cise.dsr.examples.WikiLink use the follwing command.

~/projects/spark/bin/spark-submit\
 --class "edu.ufl.cise.dsr.examples.WikiLink"\
 --master "local[4]"\
 /home/cgrant/projects/qdstreaming/code/target/scala-2.10/qdstreaming-assembly-0.01.jar

Even better, to run on the sm321 server use the following command.

time ~/projects/spark/bin/spark-submit\
 --class "edu.ufl.cise.dsr.examples.DistributedER"\
 --supervise --driver-cores 1 --total-executor-cores 32 --executor-memory 2G\ 
 --driver-memory 6G -v --master spark://sm321-01.cise.ufl.edu:7077\
 /home/cgrant/projects/qdstreaming/code/target/scala-2.10/qdstreaming-assembly-0.01.jar

To launch a process in the REPL use the following command.

~/projects/spark/bin/spark-shell\
 --master local[8]\
 --jars /home/cgrant/projects/qdstreaming/code/target/scala-2.10/qdstreaming-assembly-0.01.jar

Algorithm

We apply the doubling algorithm in minibatches accepting the stream of document and monitoring a set of queries, keeping the clusters about the same size.

IngestorActor

Pulls documents from a data stream
Pushes documents to a classifier actor

ClassifierActor

Vectorizes the documents
- Extracts entity chains and contexts
Accepts documents
discards bad documents
- If new doc is farther than the difference of the individual cluster centers
pushes docs to the CorefActor

CorefActor

Accept vectorized documents
Knows the centroid of all CanopyActors
Run NewDoublingAlgorithm
Default in update mode
When Canopy Actors complain
- Sends a signal to the MergeActor

CanopyActor

Store local data structures
- Pairwise similarity
- term-document frequencies (ambiquity)
Accepts vectorized documents
Represents a cluster
Keeps track of its containing entity nodes
Performs incremental entity resolution
if size is large
- Perform a SCRUB step
- If SCRUB doesnt help
  - Complain to CorefActor

Possibly have several duplicate canopies doing random coreference and perform periodic merges for consensus

MergeActor

Performs MERGE on Canopy Actors
Call on Canopy actors UNLOAD

Technical Merit

SCRUB/UNLOAD to the doubling algorithm
- SCRUB --- Get rid of non-query items, or "weird" items
- UNLOAD --- Create a summary of for ER use.
Self-managing Canopies
- We can also have clone canopies, these canopies communicate to make merge decisions.
  - (1) Clone canopies only keep mentions that appear in at least one of the query nodes.
- (2) Canopies also throw our any instances of duplicate informaiton. Obviouse duplicates (redundant information) is not needed to make decisions.
Query-distribution, The number of query nodes vs the number of clusters/canopies
- If cannopies become large, create more sub-entities
- Swaps at the sub-entity level
Coreference models
- Query-Driven within a canopy
- Lifted Inference (Percy Liang)
- Heirchical merge proposals (wick12hierarchical.pdf)

Experiments

Successful merges/second for increased number of query nodes
- This should increase with more query nodes
- If not query-driven is pointless
([Merges/Second] and F1-Score) vs Scrub rate
- Find the optimal scrub rate
[Document Arrival Rate] vs [Merges/Second]
- As the arrival rate increases the merges/second should not decrease
- Constant or increase means the system is able to handle increase
[Entity Chains Seen] v [# Clusters produced] v [# query nodes]
- Clusters produced should increase with entity chains seen
- Clusters produced should increase with number of query nodes
[Coref Chains processed] v [Time] v [Query nodes]
- Im not sure

Data Sets

Use bibtex http://www.iesl.cs.umass.edu/data/bibtex

Visualization

Large Scale Real-time visualization using Superconductor [http://superconductor.github.io/superconductor/]

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
code		code
datasets/joensuu_datasets		datasets/joensuu_datasets
examples		examples
paper		paper
.gitignore		.gitignore
README.md		README.md
mount_data.sh		mount_data.sh
unmont_data.sh		unmont_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

datasets/joensuu_datasets

datasets/joensuu_datasets

examples

examples

paper

paper

.gitignore

.gitignore

README.md

README.md

mount_data.sh

mount_data.sh

unmont_data.sh

unmont_data.sh

Repository files navigation

qdstreaming

Compile

Algorithm

Technical Merit

Experiments

Data Sets

Visualization

About

Releases

Packages

Contributors 3

Languages

cegme/qdstreaming

Folders and files

Latest commit

History

Repository files navigation

qdstreaming

Compile

Algorithm

Technical Merit

Experiments

Data Sets

Visualization

About

Resources

Stars

Watchers

Forks

Languages