Torsten Hoefler

Orcid: 0000-0002-1333-9797

Affiliations:
  • ETH Zürich


According to our database1, Torsten Hoefler authored at least 420 papers between 2005 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2022, "For foundational contributions to High-Performance Computing and the application of HPC techniques to machine learning".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost.
IEEE Trans. Parallel Distributed Syst., August, 2024

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis.
IEEE Trans. Pattern Anal. Mach. Intell., May, 2024

Canary: Congestion-aware in-network allreduce using dynamic trees.
Future Gener. Comput. Syst., March, 2024

Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries.
ACM Comput. Surv., February, 2024

A High-Performance, Energy-Efficient Modular DMA Engine Architecture.
IEEE Trans. Computers, January, 2024

Digital twins of Earth and the computing challenge of human interaction.
Nat. Comput. Sci., 2024

RED-SEA Project: Towards a new-generation European interconnect.
Microprocess. Microsystems, 2024

All models are wrong, some are useful: Model Selection with Limited Labels.
CoRR, 2024

Fortify Your Foundations: Practical Privacy and Security for Foundation Model Deployments In The Cloud.
CoRR, 2024

SeBS-Flow: Benchmarking Serverless Cloud Function Workflows.
CoRR, 2024

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects.
CoRR, 2024

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI.
CoRR, 2024

Hardware Acceleration for Knowledge Graph Processing: Challenges & Recent Developments.
CoRR, 2024

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models.
CoRR, 2024

Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip.
CoRR, 2024

High Performance Unstructured SpMM Computation Using Tensor Cores.
CoRR, 2024

REPS: Recycling Entropies for Packet Spraying to Adaptively Explore Paths and Mitigate Failures.
CoRR, 2024

Demystifying Higher-Order Graph Neural Networks.
CoRR, 2024

Accelerating Graph-based Vector Search via Delayed-Synchronization Traversal.
CoRR, 2024

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs.
CoRR, 2024

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks.
CoRR, 2024

FPsPIN: An FPGA-based Open-Hardware Research Platform for Processing in the Network.
CoRR, 2024

Towards Specialized Supercomputers for Climate Sciences: Computational Requirements of the Icosahedral Nonhydrostatic Weather and Climate Model.
CoRR, 2024

SpComm3D: A Framework for Enabling Sparse Communication in 3D Sparse Kernels.
CoRR, 2024

LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming.
CoRR, 2024

SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies.
CoRR, 2024

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs.
CoRR, 2024

Topologies of Reasoning: Demystifying Chains, Trees, and Graphs of Thoughts.
CoRR, 2024

Cppless: Productive and Performant Serverless Programming in C++.
CoRR, 2024

XaaS: Acceleration as a Service to Enable Productive High-Performance Cloud Computing.
CoRR, 2024

OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs.
Proceedings of the 2024 USENIX Annual Technical Conference, 2024

PolarStar: Expanding the Horizon of Diameter-3 Networks.
Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures, 2024

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

Swing: Short-cutting Rings for Higher Bandwidth Allreduce.
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network.
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024

Software Resource Disaggregation for HPC with Serverless Computing.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Low-Depth Spatial Tree Algorithms.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

DiffDA: a Diffusion model for weather-scale Data Assimilation.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

SliceGPT: Compress Large Language Models by Deleting Rows and Columns.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Near-Optimal Wafer-Scale Reduce.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

FaaSKeeper: Learning from Building Serverless Services with ZooKeeper as an Example.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

LRSCwait: Enabling Scalable and Efficient Synchronization in Manycore Systems Through Polling-Free and Retry-Free Operation.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2024

Process-as-a-Service: Unifying Elastic and Stateful Clouds with Serverless Processes.
Proceedings of the 2024 ACM Symposium on Cloud Computing, 2024

Graph of Thoughts: Solving Elaborate Problems with Large Language Models.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra.
IEEE Trans. Parallel Distributed Syst., December, 2023

Performance Measurement Dataset of the HPC Benchmarks FASTEST, Kripke, LULESH, MiniFE, Quicksilver, and RELeARN for Scalability Studies with Extra-P.
Dataset, November, 2023

Myths and legends in high-performance computing.
Int. J. High Perform. Comput. Appl., July, 2023

Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Systems.
IEEE Trans. Parallel Distributed Syst., June, 2023



Disentangling Hype from Practicality: On Realistically Achieving Quantum Advantage.
Commun. ACM, May, 2023


Earth Virtualization Engines: A Technical Perspective.
Comput. Sci. Eng., 2023

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry" Benchmark.
CoRR, 2023

RapidChiplet: A Toolchain for Rapid Design Space Exploration of Chiplet Architectures.
CoRR, 2023

Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.
CoRR, 2023

Towards End-to-end 4-Bit Inference on Generative Large Language Models.
CoRR, 2023

Cached Operator Reordering: A Unified View for Fast GNN Training.
CoRR, 2023

High-Performance Graph Databases That Are Portable, Programmable, and Scale to Hundreds of Thousands of Cores.
CoRR, 2023

ASDL: A Unified Interface for Gradient Preconditioning in PyTorch.
CoRR, 2023

STen: Productive and Efficient Sparsity in PyTorch.
CoRR, 2023

Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization.
CoRR, 2023

PolarStar: Expanding the Scalability Horizon of Diameter-3 Networks.
CoRR, 2023

Datacenter Ethernet and RDMA: Issues at Hyperscale.
CoRR, 2023

Approximate Reversible Circuits for NISQ-Era Quantum Computers.
CoRR, 2023

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication.
CoRR, 2023

A Theory of I/O-Efficient Sparse Neural Network Inference.
CoRR, 2023

Data Center Ethernet and Remote Direct Memory Access: Issues at Hyperscale.
Computer, 2023

SAGE: Software-based Attestation for GPU Execution.
Proceedings of the 2023 USENIX Annual Technical Conference, 2023

In-network Allreduce with Multiple Spanning Trees on PolarFly.
Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures, 2023

A Reference Implementation for a Quantum Message Passing Interface.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs.
Proceedings of the International Conference for High Performance Computing, 2023

Co-design Hardware and Algorithm for Vector Search.
Proceedings of the International Conference for High Performance Computing, 2023

HEAR: Homomorphically Encrypted Allreduce.
Proceedings of the International Conference for High Performance Computing, 2023

VENOM: A Vectorized N: M Format for Unleashing the Power of Sparse Tensor Cores.
Proceedings of the International Conference for High Performance Computing, 2023

High-Performance and Programmable Attentional Graph Neural Networks with Global Tensor Formulations.
Proceedings of the International Conference for High Performance Computing, 2023

The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores.
Proceedings of the International Conference for High Performance Computing, 2023

PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

HOT: Higher-Order Dynamic Graph Representation Learning With Efficient Transformers.
Proceedings of the Learning on Graphs Conference, 27-30 November 2023, Virtual Event., 2023

rFaaS: Enabling High Performance Serverless with RDMA and Leases.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance Optimization.
Proceedings of the 37th International Conference on Supercomputing, 2023

FMI: Fast and Cheap Message Passing for Serverless Functions.
Proceedings of the 37th International Conference on Supercomputing, 2023

Compressing multidimensional weather and climate data into neural networks.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

OPTQ: Accurate Quantization for Generative Pre-trained Transformers.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Differentiable Transportation Pruning.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Streaming Task Graph Scheduling for Dataflow Architectures.
Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023

HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

Sparse Hamming Graph: A Customizable Network-on-Chip Topology.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

Maximum Flows in Parametric Graph Templates.
Proceedings of the Algorithms and Complexity - 13th International Conference, 2023

Bridging Control-Centric and Data-Centric Optimization.
Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, 2023

User-guided Page Merging for Memory Deduplication in Serverless Systems.
Proceedings of the IEEE International Conference on Big Data, 2023

2022
Work-Stealing Prefix Scan: Addressing Load Imbalance in Large-Scale Image Registration.
IEEE Trans. Parallel Distributed Syst., 2022

Python FPGA Programming with Data-Centric Multi-Level Design.
CoRR, 2022

Efficient RDMA Communication Protocols.
CoRR, 2022

Assessing requirements to scale to practical quantum advantage.
CoRR, 2022

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
CoRR, 2022

ENS-10: A Dataset For Post-Processing Ensemble Weather Forecast.
CoRR, 2022

Deinsum: Practically I/O Optimal Multilinear Algebra.
CoRR, 2022

The spatial computer: A model for energy-efficient parallel computation.
CoRR, 2022

FaasKeeper: a Blueprint for Serverless Services.
CoRR, 2022

The Convergence of Hyperscale Data Center and High-Performance Computing Networks.
Computer, 2022

Benchmarking Data Science: 12 Ways to Lie With Statistics and Performance on Parallel Computers.
Computer, 2022

The Red-Blue Pebble Game on Trees and DAGs with Large Input.
Proceedings of the Structural Information and Communication Complexity, 2022

KafkaDirect: Zero-copy Data Access for Apache Kafka over RDMA Networks.
Proceedings of the SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12, 2022

Deinsum: Practically I/O Optimal Multi-Linear Algebra.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Boosting Performance Optimization with Interactive Data Movement Visualization.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Efficient Quantized Sparse Matrix Operations on Tensor Cores.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

PolarFly: A Cost-Effective and Flexible Low-Diameter Topology.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

HammingMesh: A Network Topology for Large-Scale Deep Learning.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Building Blocks for Network-Accelerated Distributed File Systems.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

ProbGraph: High-Performance and High-Accuracy Graph Mining with Probabilistic Set Representations.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Productive Performance Engineering for Weather and Climate Modeling with Python.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Near-optimal sparse allreduce for distributed deep learning.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

Spatial Mixture-of-Experts.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

ENS-10: A Dataset For Post-Processing Ensemble Weather Forecasts.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

Neural Graph Databases.
Proceedings of the Learning on Graphs Conference, 2022

Motif Prediction with Graph Neural Networks.
Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

I/O-Optimal Cache-Oblivious Sparse Matrix-Sparse Matrix Multiplication.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Metamorphic Fuzzing of C++ Libraries.
Proceedings of the 15th IEEE Conference on Software Testing, Verification and Validation, 2022

Performance-detective: automatic deduction of cheap and accurate performance models.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

A data-centric optimization framework for machine learning.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

Lifting C semantics for dataflow optimization.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

Neural Parameter Allocation Search.
Proceedings of the Tenth International Conference on Learning Representations, 2022

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping.
Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, 2022

Fast Arbitrary Precision Floating Point on FPGA.
Proceedings of the 30th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2022

Accelerating Data Serialization/Deserialization Protocols with In-Network Compute.
Proceedings of the IEEE/ACM International Workshop on Exascale MPI, 2022


Circuits for Measurement Based Quantum State Preparation.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022

A RDMA Interface for Ultra-Fast Ultrasound Data-Streaming over an Optical Link.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022

NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications.
Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022

2021
Transformations of High-Level Synthesis Codes for High-Performance Computing.
IEEE Trans. Parallel Distributed Syst., 2021

Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging.
IEEE Trans. Parallel Distributed Syst., 2021

High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC Networks.
IEEE Trans. Parallel Distributed Syst., 2021

Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads.
IEEE Trans. Computers, 2021

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores.
IEEE Trans. Computers, 2021

Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation.
ACM Trans. Archit. Code Optim., 2021

Communication Lower Bounds of Bilinear Algorithms for Symmetric Tensor Contractions.
SIAM J. Sci. Comput., 2021

GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra.
Proc. VLDB Endow., 2021

Noise in the Clouds: Influence of Network Performance Variability on Application Scalability.
Proc. ACM Meas. Anal. Comput. Syst., 2021

FPL: fast Presburger arithmetic through transprecision.
Proc. ACM Program. Lang., 2021

The digital revolution of Earth-system science.
Nat. Comput. Sci., 2021

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks.
J. Mach. Learn. Res., 2021

RFaaS: RDMA-Enabled FaaS Platform for Serverless High-Performance Computing.
CoRR, 2021

Learning Combinatorial Node Labeling Algorithms.
CoRR, 2021

Towards Million-Server Network Simulations on Just a Laptop.
CoRR, 2021

SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems.
CoRR, 2021

GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra.
CoRR, 2021

Enabling Dataflow Optimization for Quantum Programs.
CoRR, 2021

ReDMArk: Bypassing RDMA Security Mechanisms.
Proceedings of the 30th USENIX Security Symposium, 2021

Naos: Serialization-free RDMA networking in Java.
Proceedings of the 2021 USENIX Annual Technical Conference, 2021

MigrOS: Transparent Live-Migration Support for Containerised RDMA Applications.
Proceedings of the 2021 USENIX Annual Technical Conference, 2021

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs.
Proceedings of the SPAA '21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, 2021

Parallel Algorithms for Finding Large Cliques in Sparse Graphs.
Proceedings of the SPAA '21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, 2021

CoRM: Compactable Remote Memory over RDMA.
Proceedings of the SIGMOD '21: International Conference on Management of Data, 2021

Productivity, portability, performance: data-centric Python.
Proceedings of the International Conference for High Performance Computing, 2021

Flare: flexible in-network allreduce.
Proceedings of the International Conference for High Performance Computing, 2021

On the parallel I/O optimality of linear algebra kernels: near-optimal matrix factorizations.
Proceedings of the International Conference for High Performance Computing, 2021

Distributed quantum computing with QMPI.
Proceedings of the International Conference for High Performance Computing, 2021

Clairvoyant prefetching for distributed machine learning I/O.
Proceedings of the International Conference for High Performance Computing, 2021

Chimera: efficiently training large-scale neural networks with bidirectional pipelines.
Proceedings of the International Conference for High Performance Computing, 2021

On the parallel I/O optimality of linear algebra kernels: near-optimal LU factorization.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Extracting clean performance models from tainted programs.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Data Movement Is All You Need: A Case Study on Optimizing Transformers.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

SeBS: a serverless benchmark suite for function-as-a-service computing.
Proceedings of the Middleware '21: 22nd International Middleware Conference, Québec City, Canada, December 6, 2021

SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

A RISC-V in-network accelerator for flexible high-performance low-power packet processing.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Noise-Resilient Empirical Performance Modeling with Deep Neural Networks.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

NPBench: a benchmarking suite for high-performance NumPy.
Proceedings of the ICS '21: 2021 International Conference on Supercomputing, 2021

ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations.
Proceedings of the 38th International Conference on Machine Learning, 2021

Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2021

An Efficient Algorithm for Sparse Quantum State Preparation.
Proceedings of the 58th ACM/IEEE Design Automation Conference, 2021

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021

Hermes: Enabling efficient large-scale simulation in MATSim.
Proceedings of the 12th International Conference on Ambient Systems, 2021

2020
ExtraPeak: Advanced Automatic Performance Modeling for HPC Applications.
Proceedings of the Software for Exascale Computing - SPPEXA 2016-2019, 2020

Substream-Centric Maximum Matchings on FPGA.
ACM Trans. Reconfigurable Technol. Syst., 2020

Polyhedral Compilation for Racetrack Memories.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Dawn: a High-level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications.
Supercomput. Front. Innov., 2020

Special issue: Selected papers from EuroMPI 2019.
Parallel Comput., 2020

Assertion-based optimization of Quantum programs.
Proc. ACM Program. Lang., 2020

Fast linear programming through transprecision computing on small and sparse data.
Proc. ACM Program. Lang., 2020

Deep Data Flow Analysis.
CoRR, 2020

Parametric Graph Templates: Properties and Algorithms.
CoRR, 2020

PsPIN: A high-performance low-power architecture for flexible in-network compute.
CoRR, 2020

TardiS: Migrating Containers with RDMA Networks.
CoRR, 2020

High-Performance Routing with Multipathing and Path Diversity in Supercomputers and Data Centers.
CoRR, 2020

Shapeshifter Networks: Cross-layer Parameter Sharing for Scalable and Effective Deep Learning.
CoRR, 2020

Domain-Specific Multi-Level IR Rewriting for GPU.
CoRR, 2020

Deep Learning for Post-Processing Ensemble Weather Forecasts.
CoRR, 2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging.
CoRR, 2020

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis.
CoRR, 2020

Snitch: A 10 kGE Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads.
CoRR, 2020

sRDMA - Efficient NIC-based Authentication and Encryption for Remote Direct Memory Access.
Proceedings of the 2020 USENIX Annual Technical Conference, 2020

Parallel Planar Subgraph Isomorphism and Vertex Connectivity.
Proceedings of the SPAA '20: 32nd ACM Symposium on Parallelism in Algorithms and Architectures, 2020

An in-depth analysis of the slingshot interconnect.
Proceedings of the International Conference for High Performance Computing, 2020

fBLAS: streaming linear algebra on FPGA.
Proceedings of the International Conference for High Performance Computing, 2020

ScalAna: automating scaling loss detection with graph analysis.
Proceedings of the International Conference for High Performance Computing, 2020

Empirical Modeling of Spatially Diverging Performance.
Proceedings of the IEEE/ACM International Workshop on HPC User Support Tools and Workshop on Programming and Performance Visualization Tools, 2020

FatPaths: routing in supercomputers and data centers when shortest paths fall short.
Proceedings of the International Conference for High Performance Computing, 2020

High-performance parallel graph coloring with strong guarantees on work, depth, and quality.
Proceedings of the International Conference for High Performance Computing, 2020

Communication and Timing Issues with MPI Virtualization.
Proceedings of the EuroMPI/USA '20: 27th European MPI Users' Group Meeting, 2020

Taming unbalanced training workloads in deep learning with partial collective operations.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Identifying scalability bottlenecks for large-scale parallel programs with graph analysis.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis.
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

ATUNs: Modular and Scalable Support for Atomic Operations in a Shared Memory Multiprocessor.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

Augment Your Batch: Improving Generalization Through Instance Repetition.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
Engineering Algorithms for Scalability through Continuous Validation of Performance Expectations.
IEEE Trans. Parallel Distributed Syst., 2019

Strong consistency is not hard to get: Two-Phase Locking and Two-Phase Commit on Thousands of Cores.
Proc. VLDB Endow., 2019

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis.
ACM Comput. Surv., 2019

Reflecting on the Goal and Baseline for Exascale Computing: A Roadmap Based on Weather and Climate Simulations.
Comput. Sci. Eng., 2019

Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism.
CoRR, 2019

A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations.
CoRR, 2019

Predicting Weather Uncertainty with Deep Convnets.
CoRR, 2019

hlslib: Software Engineering for Hardware Design.
CoRR, 2019

Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency.
CoRR, 2019

FatPaths: Routing in Supercomputers, Data Centers, and Clouds with Low-Diameter Networks when Shortest Paths Fall Short.
CoRR, 2019

Graph Processing on FPGAs: Taxonomy, Survey, Challenges.
CoRR, 2019

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs.
CoRR, 2019

Augment your batch: better training with larger batches.
CoRR, 2019

Head-of-line blocking avoidance in Slim Fly networks using deadlock-free non-minimal and adaptive routing.
Concurr. Comput. Pract. Exp., 2019

Optimizing the data movement in quantum transport simulations via data-centric parallel programming.
Proceedings of the International Conference for High Performance Computing, 2019

A data-centric approach to extreme-scale <i>ab initio</i> dissipative quantum transport simulations.
Proceedings of the International Conference for High Performance Computing, 2019

Mitigating network noise on Dragonfly networks through application-aware routing.
Proceedings of the International Conference for High Performance Computing, 2019

SparCML: high-performance sparse communication for machine learning.
Proceedings of the International Conference for High Performance Computing, 2019

Streaming message interface: high-performance distributed memory programming on reconfigurable hardware.
Proceedings of the International Conference for High Performance Computing, 2019

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication.
Proceedings of the International Conference for High Performance Computing, 2019

Network-accelerated non-contiguous memory transfers.
Proceedings of the International Conference for High Performance Computing, 2019

Slim graph: practical lossy graph compression for approximate graph processing, storage, and analytics.
Proceedings of the International Conference for High Performance Computing, 2019

Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures.
Proceedings of the International Conference for High Performance Computing, 2019

Foreword EuroMPI 2019.
Proceedings of the 26th European MPI Users' Group Meeting, 2019

Corrected trees for reliable group communication.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

A fast analytical model of fully associative caches.
Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019

Porting the COSMO Weather Model to Manycore CPUs.
Proceedings of the Platform for Advanced Scientific Computing Conference, 2019

Invited Talk 2.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2019

SimFS: A Simulation Data Virtualizing File System Interface.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Using performance models to understand scalable Krylov solver performance at scale for structured grid problems.
Proceedings of the ACM International Conference on Supercomputing, 2019

Substream-Centric Maximum Matchings on FPGA.
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019

Embedding Functions Into Reversible Circuits: A Probabilistic Approach to the Number of Lines.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

Absinthe: Learning an Analytical Performance Model to Fuse and Tile Stencil Codes in One Shot.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018
Cache-Oblivious MPI All-to-All Communications Based on Morton Order.
IEEE Trans. Parallel Distributed Syst., 2018

Using Hoare logic for quantum circuit optimization.
CoRR, 2018

Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations.
CoRR, 2018

μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching.
CoRR, 2018

SparCML: High-Performance Sparse Communication for Machine Learning.
CoRR, 2018

Automatic Verification of RMA Programs via Abstraction Extrapolation.
Proceedings of the Verification, Model Checking, and Abstract Interpretation, 2018

ShenTu: processing multi-trillion edge graphs on millions of cores in seconds.
Proceedings of the International Conference for High Performance Computing, 2018

Designing scalable FPGA architectures using high-level synthesis.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Communication-avoiding parallel minimum cuts and connected components.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Neural Code Comprehension: A Learnable Representation of Code Semantics.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

The Convergence of Sparsified Gradient Methods.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Reproducible Floating-Point Aggregation in RDBMSs.
Proceedings of the 34th IEEE International Conference on Data Engineering, 2018

Fast and strongly-consistent per-item resilience in key-value stores.
Proceedings of the Thirteenth EuroSys Conference, 2018

Accelerating Deep Learning Frameworks with Micro-Batches.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

Lightweight Requirements Engineering for Exascale Co-design.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

Log(graph): a near-optimal high-performance graph representation.
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018

2017
Trends in Data Locality Abstractions for HPC Systems.
IEEE Trans. Parallel Distributed Syst., 2017

Distributed Join Algorithms on Thousands of Cores.
Proc. VLDB Endow., 2017

Designing Databases for Future High-Performance Networks.
IEEE Data Eng. Bull., 2017

A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem.
Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, 2017

Scaling betweenness centrality using communication-efficient sparse matrix multiplication.
Proceedings of the International Conference for High Performance Computing, 2017

sPIN: high-performance streaming processing in the network.
Proceedings of the International Conference for High Performance Computing, 2017

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

POSTER: Cache-Oblivious MPI All-to-All Communications on Many-Core Architectures.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

IPDRM Workshop Introduction.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Corrected Gossip Algorithms for Fast Reliable Broadcast on Unreliable Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

EMBRACE Keynote.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Transparent Caching for RMA Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

SlimSell: A Vectorizable Graph Representation for Breadth-First Search.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Model-Driven Choice of Numerical Methods for the Solution of the Linear Advection Equation.
Proceedings of the International Conference on Computational Science, 2017

AllConcur: Leaderless Concurrent Atomic Broadcast.
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017

To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations.
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017

An Effective Queuing Scheme to Provide Slim Fly Topologies with HoL Blocking Reduction and Deadlock Freedom for Minimal-Path Routing.
Proceedings of the 3rd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era, 2017

Improving Non-minimal and Adaptive Routing Algorithms in Slim Fly Networks.
Proceedings of the 25th IEEE Annual Symposium on High-Performance Interconnects, 2017

Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches.
Proceedings of the 25th IEEE Annual Symposium on High-Performance Interconnects, 2017

Multi-agent Pathfinding with n Agents on Graphs with n Vertices: Combinatorial Classification and Tight Algorithmic Bounds.
Proceedings of the Algorithms and Complexity - 10th International Conference, 2017

2016
Automatic Performance Modeling of HPC Applications.
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016

Cache Line Aware Algorithm Design for Cache-Coherent Architectures.
IEEE Trans. Parallel Distributed Syst., 2016

Exploiting Offload-Enabled Network Interfaces.
IEEE Micro, 2016

On noise and the performance benefit of nonblocking collectives.
Int. J. High Perform. Comput. Appl., 2016

Betweenness Centrality is more Parallelizable than Dense Matrix Multiplication.
CoRR, 2016

AllConcur: Leaderless Concurrent Atomic Broadcast (Extended Version).
CoRR, 2016

Extreme scale plasma turbulence simulations on top supercomputers worldwide.
Proceedings of the International Conference for High Performance Computing, 2016

A PCIe congestion-aware performance model for densely populated accelerator servers.
Proceedings of the International Conference for High Performance Computing, 2016

dCUDA: hardware supported overlap of computation and communication.
Proceedings of the International Conference for High Performance Computing, 2016

Scheduling-aware routing for supercomputers.
Proceedings of the International Conference for High Performance Computing, 2016

Selecting Technical Papers for an Interdisciplinary Conference: The PASC Review Process.
Proceedings of the Platform for Advanced Scientific Computing Conference, 2016

Modeling and analysis of remote memory access programming.
Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, 2016

Polly-ACC Transparent compilation to heterogeneous hardware.
Proceedings of the 2016 International Conference on Supercomputing, 2016

SDNsec: Forwarding Accountability for the SDN Data Plane.
Proceedings of the 25th International Conference on Computer Communication and Networks, 2016

High-Performance Distributed RMA Locks.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

Routing on the Dependency Graph: A New Approach to Deadlock-Free High-Performance Routing.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

Ensuring Deadlock-Freedom in Low-Diameter InfiniBand Networks.
Proceedings of the 24th IEEE Annual Symposium on High-Performance Interconnects, 2016

Fast Multi-parameter Performance Modeling.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015
Remote Memory Access Programming in MPI-3.
ACM Trans. Parallel Comput., 2015

Introduction to the Special Issue on SPAA 2013.
ACM Trans. Parallel Comput., 2015

Sparse Tensor Algebra as a Parallel Programming Model.
CoRR, 2015

Cost-effective diameter-two topologies: analysis and evaluation.
Proceedings of the International Conference for High Performance Computing, 2015

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results.
Proceedings of the International Conference for High Performance Computing, 2015

HIPS-LSPP Keynotes.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Exascaling Your Library: Will Your Implementation Meet Your Expectations?
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Cache Line Aware Optimizations for ccNUMA Systems.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

DARE: High-Performance State Machine Replication on RDMA Networks.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Distributing the Data Plane for Remote Storage Access.
Proceedings of the 15th Workshop on Hot Topics in Operating Systems, 2015

Source-Based Path Selection: The Data Plane Perspective.
Proceedings of the 10th International Conference on Future Internet, 2015

Evaluating the Cost of Atomic Operations on Modern Architectures.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

Using Compiler Techniques to Improve Automatic Performance Modeling.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014
Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations.
Supercomput. Front. Innov., 2014

Enabling highly-scalable remote memory access programming with MPI-3 One Sided.
Sci. Program., 2014

Application-oriented ping-pong benchmarking: how to assess the real communication overheads.
Computing, 2014

Improved MPI collectives for MPI processes in shared address spaces.
Clust. Comput., 2014

Automatic complexity analysis of explicitly parallel programs.
Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, 2014

Understanding the Effects of Communication and Coordination on Checkpointing at Scale.
Proceedings of the International Conference for High Performance Computing, 2014

Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures.
Proceedings of the International Conference for High Performance Computing, 2014

Slim Fly: A Cost Effective Low-Diameter Network Topology.
Proceedings of the International Conference for High Performance Computing, 2014

Exploring the effect of noise on the performance benefit of nonblocking allreduce.
Proceedings of the 21st European MPI Users' Group Meeting, 2014

Designing Bit-Reproducible Portable High-Performance Applications.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks.
Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

Fault tolerance for remote memory access programming models.
Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

Catwalk: A Quick Development Path for Performance Models.
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

PEMOGEN: automatic adaptive performance modeling during program runtime.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Fast pattern-specific routing for fat tree networks.
ACM Trans. Archit. Code Optim., 2013

Operating systems and runtime environments on supercomputers.
Int. J. High Perform. Comput. Appl., 2013

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory.
Computing, 2013

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, 2013

Hybrid MPI: efficient message passing for multi-core systems.
Proceedings of the International Conference for High Performance Computing, 2013

Using automated performance modeling to find scalability bugs in complex codes.
Proceedings of the International Conference for High Performance Computing, 2013

MPI datatype processing using runtime compilation.
Proceedings of the 20th European MPI Users's Group Meeting, 2013

Ownership passing: efficient distributed memory programming on multi-core systems.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

Compiler Optimizations for Non-contiguous Remote Data Movement.
Proceedings of the Languages and Compilers for Parallel Computing, 2013

Bandwidth-optimal all-to-all exchanges in fat tree networks.
Proceedings of the International Conference on Supercomputing, 2013

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

NUMA-aware shared-memory collective communication for MPI.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Topic 13: High-Performance Networks and Communication - (Introduction).
Proceedings of the Euro-Par 2013 Parallel Processing, 2013

2012
Extensions for next-generation parallel programming models.
Parallel Comput., 2012

Top Picks from Hot Interconnects 2011: Petascale Network Architectures.
IEEE Micro, 2012

Abstract: Slack-Conscious Lightweight Loop Scheduling for Improving Scalability of Bulk-synchronous MPI Applications.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Optimization principles for collective neighborhood communications.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Micro-applications for Communication Data Access Patterns and MPI Datatypes.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Exact Dependence Analysis for Increased Communication Overlap.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Leveraging MPI's One-Sided Communication Interface for Shared-Memory Programming.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Automatic datatype generation and optimization.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

Communication-centric optimizations by dynamically detecting collective operations.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

Assessing HPC Failure Detectors for MPI Jobs.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

On the Effects of CPU Caches on MPI Point-to-Point Communications.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption.
Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3_rmd.
Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

Runtime detection and optimization of collective communication patterns.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Mpi on millions of Cores.
Parallel Process. Lett., 2011

The scalable process topology interface of MPI 2.2.
Concurr. Comput. Pract. Exp., 2011

Methods of creating student cluster competition teams.
Proceedings of the 2011 TeraGrid Conference - Extreme Digital Discovery, 2011

Performance modeling for systematic performance tuning.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Design and Evaluation of Nonblocking Collective I/O Operations.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Writing Parallel Libraries with MPI - Common Practice, Issues, and Extensions.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Performance Expectations and Guidelines for MPI Derived Datatypes.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Active pebbles: a programming model for highly parallel fine-grained data-driven computations.
Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2011

Kanor - A Declarative Language for Explicit Communication.
Proceedings of the Practical Aspects of Declarative Languages, 2011

HIPS Introduction.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Deadlock-Free Oblivious Routing for Arbitrary Topologies.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Active pebbles: parallel programming for data-driven applications.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Generic topology mapping strategies for large-scale parallel architectures.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

2010
Accurately measuring overhead, communication time and progression of blocking and nonblocking collective operations at massive scale.
Int. J. Parallel Emergent Distributed Syst., 2010

Software and Hardware Techniques for Power-Efficient HPC Networking.
Comput. Sci. Eng., 2010

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation.
Proceedings of the Conference on High Performance Computing Networking, 2010

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues.
Proceedings of the Recent Advances in the Message Passing Interface, 2010

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes.
Proceedings of the Recent Advances in the Message Passing Interface, 2010

Efficient MPI Support for Advanced Hybrid Programming Models.
Proceedings of the Recent Advances in the Message Passing Interface, 2010

Scalable communication protocols for dynamic sparse data exchange.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

LogGOPSim: simulating large-scale applications in the LogGOPS model.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010

The PERCS High-Performance Interconnect.
Proceedings of the IEEE 18th Annual Symposium on High Performance Interconnects, 2010

A space-efficient parallel algorithm for computing betweenness centrality in distributed memory.
Proceedings of the 2010 International Conference on High Performance Computing, 2010

Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC.
Proceedings of the Euro-Par 2010 Parallel Processing Workshops, 2010

AM++: a generalized active message framework.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

2009
LogGP in theory and practice - An in-depth analysis of modern interconnection networks and benchmarking methods for collective operations.
Simul. Model. Pract. Theory, 2009

The Effect of Network Noise on Large-Scale Collective Communications.
Parallel Process. Lett., 2009

Towards Efficient MapReduce Using MPI.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2009

Implementation and analysis of nonblocking collective operations on SCI networks.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Sparse collective operations for MPI.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

The impact of network noise at large-scale communication performance.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

A power-aware, application-based performance study of modern commodity cluster interconnection networks.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Group Operation Assembly Language - A Flexible Way to Express Collective Communication.
Proceedings of the ICPP 2009, 2009

Optimized Routing for Large-Scale InfiniBand Networks.
Proceedings of the 17th IEEE Symposium on High Performance Interconnects, 2009

Demand-driven execution of static directed acyclic graphs using task parallelism.
Proceedings of the 16th International Conference on High Performance Computing, 2009

2008
Leveraging non-blocking collective communication in high-performance applications.
Proceedings of the SPAA 2008: Proceedings of the 20th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2008

Communication Optimization for Medical Image Reconstruction Algorithms.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008

Sparse Non-blocking Collectives in Quantum Mechanical Calculations.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008

Accurately measuring collective operations at massive scale.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Optimizing non-blocking collective operations for infiniband.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Adaptive Routing Strategies for Modern High Performance Networks.
Proceedings of the 16th Annual IEEE Symposium on High Performance Interconnects (HOTI 2008), 2008

Multistage switches are not crossbars: Effects of static routing in high-performance networks.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Message progression in parallel computing - to thread or not to thread?
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Overlapping Communication and Computation with High Level Communication Routines.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

An Optimized ZGEMM Implementation for the Cell BE.
Proceedings of the 9th Workshop on Parallel Systems and Algorithms (PASA) held at the 21st Conference on the Architecture of Computing Systems (ARCS), 2008

2007
Optimizing a conjugate gradient solver with non-blocking collective operations.
Parallel Comput., 2007

Implementation and performance analysis of non-blocking collective operations for MPI.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

A Case for Standard Non-blocking Collective Operations.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Low-Overhead LogGP Parameter Assessment for Modern Interconnection Networks.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Netgauge: A Network Performance Measurement Framework.
Proceedings of the High Performance Computing and Communications, 2007

2006
IRS - A Portable Interface for Reconfigurable Systems.
Proceedings of the Fifth International Conference on Parallel Computing in Electrical Engineering (PARELEC 2006), 2006

Assessing Single-Message and Multi-Node Communication Performance of InfiniBand.
Proceedings of the Fifth International Conference on Parallel Computing in Electrical Engineering (PARELEC 2006), 2006

A Case for Non-blocking Collective Operations.
Proceedings of the Frontiers of High Performance Computing and Networking, 2006

LogfP - a model for small messages in InfiniBand.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Fast barrier synchronization for InfiniBand™.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack.
Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters.
Proceedings of the ARCS 2006, 2006

2005
A Practical Approach to the Rating of Barrier Algorithms Using the LogP Model and Open MPI.
Proceedings of the 34th International Conference on Parallel Processing Workshops (ICPP 2005 Workshops), 2005


  Loading...