Abstract
This paper describes a new parallel execution model motivated by 1) the idea that computation should move to, and execute near, the global data which it accesses, 2) a set of extended memory semantics to provide fine-grained global synchronization, 3) matching shared-memory architecture research, and 4) the need for high performance languages to provide protected system transparency. We compare this new model to MPI, Chapel, X10, and UPC, in terms of 1) expressibility of parallel structures, 2) shared memory synchronization, and 3) performance tuning. Initial simulation results of a graph traversal kernel on a research architecture good speedup up to 256 multicore nodes supporting over 1 million simultaneous threadlets.