-
Notifications
You must be signed in to change notification settings - Fork 1
abhowmick22/DistributedKMeans
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
FILE AND DIRECTORY ORGANIZATION: 1. D2Impl: Contains the source files for the 2D dataset code. 2. DNAImpl : Contains the source files for the DNA Strand dataset code. 3. Makefile (within D2Impl and DNAImpl dirs): Contain the make commands for building the sequential and MPI implementations. 4. kmeans.sh: Contained within the DataSetGenerator directory, and is used to generate the 2D dataset 5. dnaClusterPoints.sh : Contained within the DataSetGenerator directory, used to generate the DNA dataset given number of clusters and points per cluster. 6. dnaTotPoints.sh : Contained within the DataSetGenerator directory, used to generate the DNA dataset given total number of points. 7. machines: File used to specify the machines used to run MPI on. 8. Lab4Report.pdf - The lab report describing the implementation and analysis. COMPILE: -----For the 2D dataset, please enter the D2Impl directory and do the following: make outmpi make outseq These commands will create the executables outmpi (executable that uses OpenMPI) and outseq (Executable for the sequential code). -----For the DNA dataset, please enter the DNAImpl directory and do the following: make dnampi make dnaseq These commands will create the executables dnampi (executable that uses OpenMPI) and dnaseq (Executable for the sequential code). EXECUTE: -----For the 2D dataset: 1. Edit and run kmeans.sh (b=points per cluster, k=number of clusters), located in DataGeneratorScripts. This will create the corresponding CSV file in the input directory in DataSetGenerator directory. 2. Place the executables (e.g. outmpi) in the public directory on ghc machines. 3. Go to your home directory on the GHC machines (one which contains public, private directories etc.) and run the following commands from there: For running sequential code: time ./public/outseq -n $n -p $p -i $inputFile where $n is the number of clusters, $p is the number of points, and $inputFile is the input file containing the 2D points. For example, time ./public/outseq -n 5 -p 50 -i public/test.csv For running the MPI code: time /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np $numProcs -machinefile $machFile public/outmpi -n $n -p $p -i $inputFile where $numProcs is the number of processors (worker nodes), $machFile is the machines file, and $n, $p and $inputFile are as above. For example: /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np 4 -machinefile public/machines public/outmpi -n 5 -p 50 -i public/test.csv The available options for execution are: -i : input file path (M) -n : number of clusters (M) -p : number of points (M) -d : number of dimensions of data (O for 2D dataset, M for DNA dataset). (Default: 2 for 2D dataset) -o : output file (O) -t : threshold for stopping (double) (O) (default:0.001) -m : maximum number of iterations (O) (default:100000) -v : print time calculated by our program (O). 1 for printing time, 0 for not printing time (default:0) where M: mandatory, O: optional Please note that in our implementation, we had to supply optional parameters without a space between the option and the parameter, e.g., -v1. -----For the DNA dataset 1. Edit and run dnaClusterPoints.sh (o=relative path of output file, s=size of DNA strand, c=Number of clusters, p=Number of points per cluster), located in DataGeneratorScripts. This will create the corresponding CSV file in the input directory in DataSetGenerator directory. We can also use dnaTotPoints.sh to generate data points given just the total number of points, but we don't use that for testing. 2. Place the executables (e.g. dnampi) in the public directory on ghc machines. 3. Go to your home directory on the GHC machines (one which contains public, private directories etc.) and run the following commands from there: For running sequential code: time ./public/dnaseq -n $n -p $p -d $d -i $inputFile where $n is the number of clusters, $p is the number of points, $d is the size of DNA strands and $inputFile is the input file containing the DNA strands. For example, time ./public/dnaseq -n 5 -p 50 -d 20 -i public/test.csv For running the MPI code: time /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np $numProcs -machinefile $machFile public/dnampi -n $n -p $p -d $dim -i $inputFile where $numProcs is the number of processors (worker nodes), $machFile is the machines file, $dim is the dimension of DNA strands and $n, $p and $inputFile are as above. For example: /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 -np 4 -machinefile public/machines public/outmpi -n 5 -p 50 -d 20 -i public/test.csv The available options for execution are: -i : input file path (M) -n : number of clusters (M) -p : number of points (M) -d : number of dimensions of data (O for 2D dataset, M for DNA dataset). (Default: 2 for 2D dataset) -o : output file (O) -t : threshold for stopping (double) (O) (default:0.001) -m : maximum number of iterations (O) (default:100000) -v : print time calculated by our program (O). 1 for printing time, 0 for not printing time (default:0) where M: mandatory, O: optional Please note that in our implementation, we had to supply optional parameters without a space between the option and the parameter, e.g., -v1.
About
KMeans implementation for point / DNA Strand clustering using OpenMPI
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published