Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives.
,
,
,
,
,
,
,
,
,
,
,
CoRR, March, 2025
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts.
,
,
,
,
,
,
,
,
,
,
,
CoRR, February, 2025
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.
CoRR, 2024
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion.
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques.
CoRR, 2019
Analytical modeling of cache behavior for affine programs.
Proc. ACM Program. Lang., 2018
Efficient Cache Simulation for Affine Computations.
Proceedings of the Languages and Compilers for Parallel Computing, 2017
Static and Dynamic Frequency Scaling on Multicore CPUs.
ACM Trans. Archit. Code Optim., 2016
PolyCheck: dynamic verification of iteration space transformations on affine programs.
Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016
Effective padding of multidimensional arrays to avoid cache conflict misses.
Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016
PWCET: Power-Aware Worst Case Execution Time Analysis.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014