GitHub - abhowmick22/DistributedKMeans: KMeans implementation for point / DNA Strand clustering using OpenMPI

abhowmick22 / DistributedKMeans Public

Notifications You must be signed in to change notification settings
Fork 1
Star 1

KMeans implementation for point / DNA Strand clustering using OpenMPI

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
DataGeneratorScripts		DataGeneratorScripts
dnaStrand		dnaStrand
myimpl		myimpl
.gitignore		.gitignore
DataGeneratorScripts.zip		DataGeneratorScripts.zip
Lab4Report.docx		Lab4Report.docx
Lab4Report.pdf		Lab4Report.pdf
P4 Test Cases.xlsx		P4 Test Cases.xlsx
README		README
clustering.jpg		clustering.jpg
contributors.txt		contributors.txt
openmpi.pdf		openmpi.pdf

Repository files navigation

FILE AND DIRECTORY ORGANIZATION:

1. D2Impl: Contains the source files for the 2D dataset code.
2. DNAImpl : Contains the source files for the DNA Strand dataset code.
3. Makefile (within D2Impl and DNAImpl dirs): Contain the make commands for building the sequential and MPI implementations.
4. kmeans.sh: Contained within the DataSetGenerator directory, and is used to generate the 2D dataset
5. dnaClusterPoints.sh : Contained within the DataSetGenerator directory, used to generate the DNA dataset given number of clusters and points per cluster.
6. dnaTotPoints.sh : Contained within the DataSetGenerator directory, used to generate the DNA dataset given total number of points.
7. machines: File used to specify the machines used to run MPI on.
8. Lab4Report.pdf  -  The lab report describing the implementation and analysis.


COMPILE:

-----For the 2D dataset, please enter the D2Impl directory and do the following:

make outmpi
make outseq

These commands will create the executables outmpi (executable that uses OpenMPI) and outseq (Executable for the sequential code). 

-----For the DNA dataset, please enter the DNAImpl directory and do the following:

make dnampi
make dnaseq

These commands will create the executables dnampi (executable that uses OpenMPI) and dnaseq (Executable for the sequential code). 

EXECUTE:

-----For the 2D dataset:
1. Edit and run kmeans.sh (b=points per cluster, k=number of clusters), located in DataGeneratorScripts. This will create the corresponding CSV file in the input
directory in DataSetGenerator directory.
2. Place the executables (e.g. outmpi) in the public directory on ghc machines.
3. Go to your home directory on the GHC machines (one which contains public, private directories etc.) and run the following commands from there:

For running sequential code:
	time ./public/outseq -n $n -p $p -i $inputFile
where $n is the number of clusters, $p is the number of points, and $inputFile is the input file containing the 2D points.
For example,
	time ./public/outseq -n 5 -p 50 -i public/test.csv

For running the MPI code:
	time /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np $numProcs -machinefile $machFile public/outmpi -n $n -p $p -i $inputFile
where $numProcs is the number of processors (worker nodes), $machFile is the machines file, and $n, $p and $inputFile are as above.
For example:
	/usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np 4 -machinefile public/machines public/outmpi -n 5 -p 50 -i public/test.csv

The available options for execution are:
-i : input file path (M)
-n : number of clusters (M)
-p : number of points (M)
-d : number of dimensions of data (O for 2D dataset, M for DNA dataset). (Default: 2 for 2D dataset)
-o : output file (O)
-t : threshold for stopping (double) (O) (default:0.001)
-m : maximum number of iterations (O) (default:100000)
-v : print time calculated by our program (O). 1 for printing time, 0 for not printing time (default:0)

where M: mandatory, O: optional
Please note that in our implementation, we had to supply optional parameters without a space between the option and the parameter, e.g., -v1. 


-----For the DNA dataset
1. Edit and run dnaClusterPoints.sh (o=relative path of output file, s=size of DNA strand, c=Number of clusters, p=Number of points per cluster), located in DataGeneratorScripts. This will create the corresponding CSV file in the input
directory in DataSetGenerator directory. We can also use dnaTotPoints.sh to generate data points given just the total number of points, but we don't use that for testing.
2. Place the executables (e.g. dnampi) in the public directory on ghc machines.
3. Go to your home directory on the GHC machines (one which contains public, private directories etc.) and run the following commands from there:

For running sequential code:
	time ./public/dnaseq -n $n -p $p -d $d -i $inputFile
where $n is the number of clusters, $p is the number of points, $d is the size of DNA strands and $inputFile is the input file containing the DNA strands.
For example,
	time ./public/dnaseq -n 5 -p 50 -d 20 -i public/test.csv

For running the MPI code:
	time /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np $numProcs -machinefile $machFile public/dnampi -n $n -p $p -d $dim -i $inputFile
where $numProcs is the number of processors (worker nodes), $machFile is the machines file, $dim is the dimension of DNA strands and $n, $p and $inputFile are as above.
For example:
	/usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np 4 -machinefile public/machines public/outmpi -n 5 -p 50 -d 20 -i public/test.csv

The available options for execution are:
-i : input file path (M)
-n : number of clusters (M)
-p : number of points (M)
-d : number of dimensions of data (O for 2D dataset, M for DNA dataset). (Default: 2 for 2D dataset)
-o : output file (O)
-t : threshold for stopping (double) (O) (default:0.001)
-m : maximum number of iterations (O) (default:100000)
-v : print time calculated by our program (O). 1 for printing time, 0 for not printing time (default:0)

where M: mandatory, O: optional
Please note that in our implementation, we had to supply optional parameters without a space between the option and the parameter, e.g., -v1.

About

KMeans implementation for point / DNA Strand clustering using OpenMPI

Readme

Activity

1 star

2 watching

1 fork

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataGeneratorScripts

DataGeneratorScripts

dnaStrand

dnaStrand

myimpl

myimpl

.gitignore

.gitignore

DataGeneratorScripts.zip

DataGeneratorScripts.zip

Lab4Report.docx

Lab4Report.docx

Lab4Report.pdf

Lab4Report.pdf

P4 Test Cases.xlsx

P4 Test Cases.xlsx

README

README

clustering.jpg

clustering.jpg

contributors.txt

contributors.txt

openmpi.pdf

openmpi.pdf

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

abhowmick22/DistributedKMeans

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages