Saeed Maleki

Orcid: 0000-0002-7998-3681

According to our database1, Saeed Maleki authored at least 39 papers between 2011 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Efficient Schedule Construction for Distributed Execution of Large DNN Models.
IEEE Trans. Parallel Distributed Syst., December, 2024

ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics.
CoRR, 2024

nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training.
Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, 2024

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2024

Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation.
Proceedings of the Nineteenth European Conference on Computer Systems, 2024

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2024

2023
Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM.
CoRR, 2023

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem.
CoRR, 2023

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction.
CoRR, 2023

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches.
Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023

MSCCLang: Microsoft Collective Communication Language.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
Error-Covariance Analysis of Monocular Pose Estimation Using Total Least Squares.
CoRR, 2022

Optimal Pose Estimation and Covariance Analysis with Simultaneous Localization and Mapping Applications.
CoRR, 2022

MSCCL: Microsoft Collective Communication Library.
CoRR, 2022

Breaking the computation and communication abstraction barrier in distributed machine learning workloads.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022

2021
Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL.
CoRR, 2021

Total Least Squares for Optimal Pose Estimation.
CoRR, 2021

CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning.
CoRR, 2021

Synthesizing optimal collective algorithms.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Scaling Distributed Training with Adaptive Summation.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

Distributed Training of Embeddings using Graph Analytics.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

2019
Distributed Word2Vec using Graph Analytics Frameworks.
CoRR, 2019

CHET: an optimizing compiler for fully-homomorphic neural-network inferencing.
Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019

2018
An empirical study of the effect of source-level loop transformations on compiler stability.
Proc. ACM Program. Lang., 2018

CHET: Compiler and Runtime for Homomorphic Evaluation of Tensor Programs.
CoRR, 2018

Semantics-Preserving Parallelization of Stochastic Gradient Descent.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

2017
Parallel Stochastic Gradient Descent with Sound Combiners.
CoRR, 2017

LORE: A loop repository for the evaluation of compilers.
Proceedings of the 2017 IEEE International Symposium on Workload Characterization, 2017

2016
Low-Rank Methods for Parallelizing Dynamic Programming Algorithms.
ACM Trans. Parallel Comput., 2016

Efficient parallelization using rank convergence in dynamic programming algorithms.
Commun. ACM, 2016

DSMR: a shared and distributed memory algorithm for single-source shortest path problem.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

DSMR: A Parallel Algorithm for Single-Source Shortest Path Problem.
Proceedings of the 2016 International Conference on Supercomputing, 2016

Parallelizing WFST speech decoders.
Proceedings of the 2016 IEEE International Conference on Acoustics, 2016

2015
Communication avoiding parallel algorithms for amorphous problems
PhD thesis, 2015

2014
Parallelizing dynamic programming through rank convergence.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Tiled Linear Algebra a System for Parallel Graph Algorithms.
Proceedings of the Languages and Compilers for Parallel Computing, 2014

2012
Performance Portability with the Chapel Language.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

2011
An Evaluation of Vectorizing Compilers.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011


  Loading...