Mikhail Smelyanskiy

Dhiraj D. Kalamkar

Md. Mostofa Ali Patwary

Int. J. High Perform. Comput. Appl., 2016

Scaling up Hartree-Fock calculations on Tianhe-2.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2016

qHiPSTER: The Quantum High Performance Software Testing Environment.

[BibT_eX]

[DOI]

Nicolas P. D. Sawaya

Alán Aspuru-Guzik

CoRR, 2016

Large Scale Distributed Hessian-Free Optimization for Deep Neural Network.

[BibT_eX]

[DOI]

CoRR, 2016

High performance emulation of quantum circuits.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

High Performance Parallel Stochastic Gradient Descent in Shared Memory.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Sparso: Context-driven Optimizations of Sparse Linear Algebra.

[BibT_eX]

[DOI]

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

2015

Can traditional programming bridge the ninja performance gap for parallel computing applications?

[BibT_eX]

[DOI]

Commun. ACM, 2015

High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

2014

Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing - 29th International Conference, 2014

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices.

[BibT_eX]

[DOI]

Jongsoo Park

Dhiraj D. Kalamkar

Xing Liu

Md. Mostofa Ali Patwary

Yutong Lu

Proceedings of the International Conference for High Performance Computing, 2014

Lattice QCD with Domain Decomposition on Intel® Xeon Phi Co-Processors.

[BibT_eX]

[DOI]

Tilo Wettig

Proceedings of the International Conference for High Performance Computing, 2014

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers.

[BibT_eX]

[DOI]

Alexander Breuer

Sebastian Rettenberger

Proceedings of the International Conference for High Performance Computing, 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Anatomy of High-Performance Many-Threaded Matrix Multiplication.

[BibT_eX]

[DOI]

Tyler M. Smith

Robert A. van de Geijn

Jeff R. Hammond

Field G. Van Zee

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013

Efficient backprojection-based synthetic aperture radar computation with many-core processors.

[BibT_eX]

[DOI]

Sci. Program., 2013

Lattice QCD on Intel® Xeon PhiTM Coprocessors.

[BibT_eX]

[DOI]

Bálint Joó

Dhiraj D. Kalamkar

William A. Watson III

Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors.

[BibT_eX]

[DOI]

Simon J. Pennycook

Christopher J. Hughes

Stephen A. Jarvis

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Efficient sparse matrix-vector multiplication on x86-based many-core processors.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Supercomputing, 2013

2012

Optimization of geometric multigrid for emerging multi- and manycore processors.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Improving the Performance of Dynamical Simulations Via Multiple Right-Hand Sides.

[BibT_eX]

[DOI]

Xing Liu

Edmond Chow

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

High Performance Non-uniform FFT on Modern X86-based Multi-core Systems.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

2011

High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures.

[BibT_eX]

[DOI]

Int. J. Biomed. Imaging, 2011

Designing and dynamically load balancing hybrid LU for multi/many-core.

[BibT_eX]

[DOI]

Comput. Sci. Res. Dev., 2011

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

2010

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.

[BibT_eX]

[DOI]

Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

2009

Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures.

[BibT_eX]

[DOI]

IEEE Trans. Vis. Comput. Graph., 2009

2008

Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications.

[BibT_eX]

[DOI]

Yen-Kuang Chen

Jatin Chhugani

Christopher J. Hughes

Proc. IEEE, 2008

An algorithm for the fast solution of symmetric linear complementarity problems.

[BibT_eX]

[DOI]

José Luis Morales

Jorge Nocedal

Numerische Mathematik, 2008

Atomic Vector Operations on Chip Multiprocessors.

[BibT_eX]

[DOI]

Christopher J. Hughes

Changkyu Kim

Victor W. Lee

Anthony D. Nguyen

Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

2007

Scaling performance of interior-point method on large-scale chip multiprocessor system.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

2004

Hardware/software mechanisms for increasing resource utilization on VLIW/EPIC processors.

[BibT_eX]

[DOI]

PhD thesis, 2004

Probabilistic Predicate-Aware Modulo Scheduling.

[BibT_eX]

[DOI]

Scott A. Mahlke

Edward S. Davidson

Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

2003

Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints.

[BibT_eX]

[DOI]

Proceedings of the 1st IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2003), 2003

Systematic Register Bypass Customization for Application-Specific Processors.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Conference on Application-Specific Systems, 2003

2001

Stack Value File: Custom Microarchitecture for the Stack.

[BibT_eX]

[DOI]

Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

2000

[BibT_eX]

[DOI]