Aamir Shafi

Orcid: 0000-0002-1924-2769

According to our database1, Aamir Shafi authored at least 80 papers between 2003 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer.
CoRR, 2024

Accelerating communication with multi-HCA aware collectives in MPI.
Concurr. Comput. Pract. Exp., 2024

Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning.
Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters.
Proceedings of the ISC High Performance 2024 Research Paper Proceedings (39th International Conference), 2024

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

The Case for Co-Designing Model Architectures with Hardware.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

Demystifying the Communication Characteristics for Distributed Transformer Models.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Accelerating Large Language Model Training with Hybrid GPU-based Compression.
Proceedings of the 24th IEEE International Symposium on Cluster, 2024

2023
High Performance MPI over the Slingshot Interconnect.
J. Comput. Sci. Technol., February, 2023

Network-Assisted Noncontiguous Transfers for GPU-Aware MPI Libraries.
IEEE Micro, 2023

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version.
CoRR, 2023

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Performance Characterization of Using Quantization for DNN Inference on Edge Devices.
Proceedings of the 7th IEEE International Conference on Fog and Edge Computing, 2023

Designing In-network Computing Aware Reduction Collectives in MPI.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2023

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences.
Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

ScaMP: Scalable Meta-Parallelism for Deep Learning Search.
Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training.
Proceedings of the IEEE International Conference on Big Data, 2023

MPI4Spark Meets YARN: Enhancing MPI4Spark through YARN support for HPC.
Proceedings of the IEEE International Conference on Big Data, 2023

2022
Optimizing Distributed DNN Training Using CPUs and BlueField-2 DPUs.
IEEE Micro, 2022

High Performance MPI over the Slingshot Interconnect: Early Experiences.
Proceedings of the PEARC '22: Practice and Experience in Advanced Research Computing, Boston, MA, USA, July 10, 2022

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters.
Proceedings of the High Performance Computing - 37th International Conference, 2022

"Hey CAI" - Conversational AI Enabled User Interface for HPC Tools.
Proceedings of the High Performance Computing - 37th International Conference, 2022

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters.
Proceedings of the High Performance Computing - 37th International Conference, 2022

Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Towards Java-based HPC using the MVAPICH2 Library: Early Experiences.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Designing Hierarchical Multi-HCA Aware Allgather in MPI.
Proceedings of the Workshop Proceedings of the 51st International Conference on Parallel Processing, 2022

Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2022

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022


Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI.
Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021
INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications.
Proceedings of the PEARC '21: Practice and Experience in Advanced Research Computing, 2021

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2021

Layout-aware Hardware-assisted Designs for Derived Data Types in MPI.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Efficient MPI-based Communication for GPU-Accelerated Dask Applications.
Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

2020
Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR.
Proceedings of the 6th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, 2020

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications.
Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

2019
Student Outcomes Assessment Methodology for ABET Accreditation: A Case Study of Computer Science and Computer Information Systems Programs.
IEEE Access, 2019

2018
Parameter estimation of qualitative biological regulatory networks on high performance computing hardware.
BMC Syst. Biol., 2018

Performance Comparison of a Parallel Recommender Algorithm Across Three Hadoop-Based Frameworks.
Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing, 2018

2016
An efficient schedulability condition for non-preemptive real-time systems at common scheduling points.
J. Supercomput., 2016

Towards Scalable Java HPC with Hybrid and Native Communication Devices in MPJ Express.
Int. J. Parallel Program., 2016

2015
Virtual TCAM for Data Center switches.
Proceedings of the IEEE Conference on Network Function Virtualization and Software Defined Networks, 2015

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems.
Proceedings of the International Conference on Computational Science, 2015

2014
Design and Implementation of Parallel Debugger and Profiler for MPJ Express.
CoRR, 2014

Teaching parallel programming using Java.
Proceedings of the Workshop on Education for High-Performance Computing, 2014

Design and Implementation of Hybrid and Native Communication Devices for Java HPC.
Proceedings of the International Conference on Computational Science, 2014

High Performance Message-passing InfiniBand Communication Device for Java HPC.
Proceedings of the International Conference on Computational Science, 2014

2013
An architectural evaluation of SDN controllers.
Proceedings of IEEE International Conference on Communications, 2013

An MPI-IO Compliant Java Based Parallel I/O Library.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

2012
Memory-mapping support for reducer hyperobjects.
Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, 2012

Towards Efficient Support for Parallel I/O in Java HPC.
Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012

High performance Java sockets (HPJS) for scientific health clouds.
Proceedings of the IEEE 14th International Conference on e-Health Networking, 2012

2011
Collective Asynchronous Remote Invocation (CARI): A High-Level and Effcient Communication API for Irregular Applications.
Proceedings of the International Conference on Computational Science, 2011

Device level communication libraries for high-performance computing in Java.
Concurr. Comput. Pract. Exp., 2011

2010
Multicore-enabling the MPJ express messaging library.
Proceedings of the 8th International Conference on Principles and Practice of Programming in Java, 2010

2009
Nested parallelism for multi-core HPC systems using Java.
J. Parallel Distributed Comput., 2009

A comparative study of Java and C performance in two large-scale parallel applications.
Concurr. Comput. Pract. Exp., 2009

Towards efficient shared memory communications in MPJ express.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

2008
A parallel implementation of the Finite-Domain Time-Difference algorithm using MPJ express.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

2007
A Buffering Layer to Support Derived Types and Proprietary Networks for Java HPC.
Scalable Comput. Pract. Exp., 2007

2006
Nested parallelism for multi-core systems using Java.
PhD thesis, 2006

MPJ Express Meets Gadget: Towards a Java Code for Cosmological Simulations.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2006

Parallel and Distributed Computing with Java.
Proceedings of the 5th International Symposium on Parallel and Distributed Computing (ISPDC 2006), 2006

An Approach to Buffer Management in Java HPC Messaging.
Proceedings of the Computational Science, 2006

MPJ Express: Towards Thread Safe Java HPC.
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

2005
Cluster Computing and Grid 2005 Works in Progress.
IEEE Distributed Syst. Online, 2005

2003
DIAMOnDS - DIstributed Agents for MObile & Dynamic Services
CoRR, 2003


  Loading...