As a graduate student supervised by Allan Gottlieb, I investigated support for scalable inter-process coordination on shared-memory MIMD systems. My contributions include detection and analysis of problems in architectures that implmement hardware combining. I propose design modifications that significantly mitigate these effects. I also have contributed centralized algorithms that have lower synchronization latency than those previously known (and superior to commonly used alternatives).
Below I describe two aspects of this research.
As is typical for network routers, the original combining switch included large capacity queues in order to support high bandwidth memory reference patterns that do not include hot spot traffic. As has been observed for flash-crowd network access patterns, I observed that hot spot memory references filled these queues despite the availability of combining, resulting in large queuing delays and low rates of combining in switches whose queues were not full. I modified this switch design to adaptively reduce queue capacity when combining occurs, resulting in lower queuing delays, higher rates of combining in other switches, and higher effective ``parallelism'' represented by each memory reference. Systems built using this {\em adaptive} design have substantially reduced memory latency for hot spot accesses while maintaining the high bandwidth for uniform loads.\cite{sharedMemWorkshop, ipdpsUltra}.
I also investigated the performance of centralized coordination algorithms designed for systems that dynamically combine references, and compared their performance to the more commonly used algorithms of Mellor-Crummey and Scott. These algorithms, which benefit from the availability of processor-local shared NUMA memory, minimize hot-spot congestion by instead statically assigning a separate busy-wait variable to each processor.
I discovered several centralized serialization-free algorithms for readers-writers and barrier coordination that, when no busy-waiting is required by the problem specification, generate fewer shared memory references than those previously known. When executed on systems with adaptive combining queues, these coordination algorithms have performance competitive with or superior to the commonly used alternatives.
More details are available in Technical Report TR2003-849 and my full dissertation is available from the NYU Computer Science department web site.