Async Learned User Embeddings for Ads Delivery Optimization.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash Attention.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 18th ACM Conference on Recommender Systems, 2024
Densifying Assumed-Sparse Tensors - Improving Memory Efficiency and MPI Collective Performance During Tensor Accumulation for Parallelized Training of Neural Machine Translation Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the High Performance Computing - 34th International Conference, 2019
The OpenACC data model: Preliminary study on its major challenges and implementations.
Parallel Comput., 2018
Deep Learning at Scale on NVIDIA V100 Accelerators.
Proceedings of the 2018 IEEE/ACM Performance Modeling, 2018
Implementing the OpenACC Data Model.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017
Compiler transformation of nested loops for general purpose GPUs.
Concurr. Comput. Pract. Exp., 2016
An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling.
Proceedings of the High Performance Computing - 31st International Conference, 2016
Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations.
Proceedings of the 45th International Conference on Parallel Processing, 2016
Multi-GPU Support on Single Node Using Directive-Based Programming Model.
Sci. Program., 2015
Accelerating Kirchhoff migration on GPU using directives.
Proceedings of the First Workshop on Accelerator Programming using Directives, 2014
SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014
Reduction Operations in Parallel Loops for GPGPUs.
Proceedings of the 2014 PPOPP International Workshop on Programming Models and Applications for Multicores and Manycores, 2014
NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model.
Proceedings of the Languages and Compilers for Parallel Computing, 2014
A Validation Testsuite for OpenACC 1.0.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014
Compiling a High-Level Directive-Based Programming Model for GPGPUs.
Proceedings of the Languages and Compilers for Parallel Computing, 2013
Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013
Filesystem Aware Scalable I/O Framework for Data-Intensive Parallel Applications.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013