pyCluster is a Python implementation for clustering algorithms, including PAM and CLARA, which are widely used in Data Mining. To better the performance of PAM, parallel PAM is designed and implemented using Parallel Python. The experiment shows that parallel PAM, running in 4-CPU system using 4 working tasks, doubles the performance of serial PAM.
2. Parallel PAM
3. Performance Comparison
For parallel PAM, there is a complex relationship among the number of medoids (k), the number of CPUs (c) and the number of working tasks (t). To make the performance comparison straightforward, we choose k=c=t, assuming that each CPU will be occupied by each working task for each medoid processing. The testing commands are like below:
python pam_parallel.py euroTry.txt 4 4 > euroTry_parallel_4_4.log
python pam.py euroTry.txt 4 > euroTry_4.log
And time consumed (seconds):
parallel PAM: 89.80295515060425
serial PAM: 172.6653060913086
4. Pitfalls in Parallel Python
To get the real parallel thing working, all the jobs in parallel Python should be submitted to the job servers accordingly without calling job(), which will try to wait for the result for each job and hence block the parallel running.