Parallel PAM – a PAM implementation using Parallel Python

pyCluster is a Python implementation for clustering algorithms, including PAM and CLARA, which are widely used in Data Mining. To better the performance of PAM, parallel PAM is designed and implemented using Parallel Python. The experiment shows that parallel PAM, running in 4-CPU system using 4 working tasks, doubles the performance of serial PAM.

1. PAM

1. Initialize: randomly select k of the n data points as the medoids
2. Associate each data point to the closest medoid. (“closest” here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance)
3. For each medoid m
        For each non-medoid data point o
            Swap m and o and compute the total cost of the configuration
4. Select the configuration with the lowest cost.
5. repeat steps 2 to 4 until there is no change in the medoid.

 
2. Parallel PAM

1. Initialize: randomly select k of the n data points as the medoids
2. Associate each data point to the closest medoid. (“closest” here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance)
3. For each medoid m
        parallel task running here using parallel Python
4. Select the configuration with the lowest cost.
5. repeat steps 2 to 4 until there is no change in the medoid.

 
3. Performance Comparison

For parallel PAM, there is a complex relationship among the number of medoids (k), the number of CPUs (c) and the number of working tasks (t). To make the performance comparison straightforward, we choose k=c=t, assuming that each CPU will be occupied by each working task for each medoid processing. The testing commands are like below:

python pam_parallel.py euroTry.txt 4 4 > euroTry_parallel_4_4.log
python pam.py euroTry.txt 4 > euroTry_4.log

And time consumed (seconds):

parallel PAM: 89.80295515060425
serial PAM: 172.6653060913086

4. Pitfalls in Parallel Python

To get the real parallel thing working, all the jobs in parallel Python should be submitted to the job servers accordingly without calling job(), which will try to wait for the result for each job and hence block the parallel running.

Project Name: pyCluster
Destination: Python Clustering
Language: Python
IDE: Vim
Library:
Project Web: https://github.com/daveti/pycluster
Git Read Only: https://github.com/daveti/pycluster.git

About daveti

Interested in kernel hacking, compilers, machine learning and guitars.
This entry was posted in AI/ML, Dave's Tools, Programming and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s