2025
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler.
CoRR, April, 2025

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives.
CoRR, March, 2025

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts.
CoRR, February, 2025

2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.
CoRR, 2024

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion.
CoRR, 2024

2019
NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques.
CoRR, 2019

2018
Analytical modeling of cache behavior for affine programs.
Proc. ACM Program. Lang., 2018

2017
Efficient Cache Simulation for Affine Computations.
Proceedings of the Languages and Compilers for Parallel Computing, 2017

2016
Static and Dynamic Frequency Scaling on Multicore CPUs.
ACM Trans. Archit. Code Optim., 2016

PolyCheck: dynamic verification of iteration space transformations on affine programs.
Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016

Effective padding of multidimensional arrays to avoid cache conflict misses.
Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016

2014
PWCET: Power-Aware Worst Case Execution Time Analysis.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014