Parallel PAM – a PAM implementation using Parallel Python

pyCluster is a Python implementation for clustering algorithms, including PAM and CLARA, which are widely used in Data Mining. To better the performance of PAM, parallel PAM is designed and implemented using Parallel Python. The experiment shows that parallel PAM, running in 4-CPU system using 4 working tasks, doubles the performance of serial PAM.

1. PAM

1. Initialize: randomly select k of the n data points as the medoids

2. Associate each data point to the closest medoid. (“closest” here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance)

3. For each medoid m

For each non-medoid data point o

Swap m and o and compute the total cost of the configuration

4. Select the configuration with the lowest cost.

5. repeat steps 2 to 4 until there is no change in the medoid.

2. Parallel PAM

1. Initialize: randomly select k of the n data points as the medoids

2. Associate each data point to the closest medoid. (“closest” here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance)

3. For each medoid m

parallel task running here using parallel Python

4. Select the configuration with the lowest cost.

5. repeat steps 2 to 4 until there is no change in the medoid.

3. Performance Comparison

For parallel PAM, there is a complex relationship among the number of medoids (k), the number of CPUs (c) and the number of working tasks (t). To make the performance comparison straightforward, we choose k=c=t, assuming that each CPU will be occupied by each working task for each medoid processing. The testing commands are like below:

python pam_parallel.py euroTry.txt 4 4 > euroTry_parallel_4_4.log
python pam.py euroTry.txt 4 > euroTry_4.log

And time consumed (seconds):

parallel PAM: 89.80295515060425
serial PAM: 172.6653060913086

4. Pitfalls in Parallel Python

To get the real parallel thing working, all the jobs in parallel Python should be submitted to the job servers accordingly without calling job(), which will try to wait for the result for each job and hence block the parallel running.

Project Name: pyCluster
Destination: Python Clustering
Language: Python
IDE: Vim
Library:
Project Web: https://github.com/daveti/pycluster
Git Read Only: https://github.com/daveti/pycluster.git

About daveti

Interested in kernel hacking, compilers, machine learning and guitars.

View all posts by daveti →

Parallel PAM – a PAM implementation using Parallel Python

About daveti

1 Response to Parallel PAM – a PAM implementation using Parallel Python

Leave a comment Cancel reply

Categories

Tags

Blog Stats