R. Govindarajan

Orcid: 0000-0003-2517-9994

Affiliations:
  • ERNET, India


According to our database1, R. Govindarajan authored at least 139 papers between 1986 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference.
Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024

Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

2023
Reduce, Reuse, and Adapt: Accelerating Graph Processing on GPUs.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

2022
Treebeard: An Optimizing Compiler for Decision Tree Based ML Inference.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022

2020
On odd harmonious labelling of even cycles with parallel chords and dragons with parallel chords.
Int. J. Comput. Aided Eng. Technol., 2020

2017
HAShCache: Heterogeneity-Aware Shared DRAMCache for Integrated Heterogeneous Systems.
ACM Trans. Archit. Code Optim., 2017

RLWS: A Reinforcement Learning based GPU Warp Scheduler.
CoRR, 2017

Taming warp divergence.
Proceedings of the 2017 International Symposium on Code Generation and Optimization, 2017

2016
MicroRefresh: Minimizing Refresh Overhead in DRAM Caches.
Proceedings of the Second International Symposium on Memory Systems, 2016

2015
Author Rebuttal to Rocha et al. "Comments on Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks".
J. Signal Process. Syst., 2015

A Comprehensive Analytical Performance Model of DRAM Caches.
Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, Austin, TX, USA, January 31, 2015

PRO: Progress Aware GPU Warp Scheduling Algorithm.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Approximating flow-sensitive pointer analysis using frequent itemset mining.
Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2015

2014
Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs.
ACM Trans. Archit. Code Optim., 2014

ANATOMY: an analytical model of memory system performance.
Proceedings of the ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems, 2014

Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices.
Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014

Taming Control Divergence in GPUs through Control Flow Linearization.
Proceedings of the Compiler Construction - 23rd International Conference, 2014

Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Fast Likelihood Computation in Speech Recognition using Matrices.
J. Signal Process. Syst., 2013

Runtime dependence computation and execution of loops on heterogeneous systems.
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013

Improving GPGPU concurrency with elastic kernels.
Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2013

Parallel flow-sensitive pointer analysis by graph-rewriting.
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013

2012
On-chip memory architecture exploration framework for DSP processor-based embedded system on chip.
ACM Trans. Embed. Comput. Syst., 2012

Probabilistic Shared Cache Management (PriSM).
Proceedings of the 39th International Symposium on Computer Architecture (ISCA 2012), 2012

Multiple sub-row buffers in DRAM: unlocking performance and energy improvement opportunities.
Proceedings of the International Conference on Supercomputing, 2012

CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters.
Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

Reconciling transactional conflicts with compiler's help.
Proceedings of the 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2012

Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Performance Oriented Prefetching Enhancements Using Commit Stalls.
J. Instr. Level Parallelism, 2011

Fast computation of Gaussian likelihoods using low-rank matrix approximations.
Proceedings of the IEEE Workshop on Signal Processing Systems, 2011

Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors.
Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, 2011

Variable Granularity Access Tracking Scheme for Improving the Performance of Software Transactional Memory.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

NUcache: An efficient multicore cache organization based on Next-Use distance.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Extended histories: improving regularity and performance in correlation prefetchers.
Proceedings of the High Performance Embedded Architectures and Compilers, 2011

Prioritizing constraint evaluation for efficient points-to analysis.
Proceedings of the CGO 2011, 2011

Making STMs Cache Friendly with Compiler Transformations.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

2010
Points-to Analysis as a System of Linear Equations.
Proceedings of the Static Analysis - 17th International Symposium, 2010

Handling Conflicts with Compiler's Help in Software Transactional Memory Systems.
Proceedings of the 39th International Conference on Parallel Processing, 2010

Analyzing cache performance bottlenecks of STM applications and addressing them with compiler's help.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

NUcache: a multicore cache organization based on next-use distance.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

2009
A Novel Cache Architecture and Placement Framework for Packet Forwarding Engines.
IEEE Trans. Computers, 2009

Synergistic execution of stream programs on multicores with accelerators.
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, 2009

Reducing Buffer Requirements in Core Routers Using Dynamic Buffering.
Proceedings of the 18th International Conference on Computer Communications and Networks, 2009

Software Pipelined Execution of Stream Programs on GPUs.
Proceedings of the CGO 2009, 2009

Scalable Context-Sensitive Points-to Analysis Using Multi-dimensional Bloom Filters.
Proceedings of the Programming Languages and Systems, 7th Asian Symposium, 2009

Region Based Structure Layout Optimization by Selective Data Copying.
Proceedings of the PACT 2009, 2009

2008
Impact of message compression on the scalability of an atmospheric modeling application on clusters.
Parallel Comput., 2008

Memory Architecture Exploration Framework for Cache Based Embedded SOC.
Proceedings of the 21st International Conference on VLSI Design (VLSI Design 2008), 2008

A systematic approach to synthesis of verification test-suites for modular SoC designs.
Proceedings of the 21st Annual IEEE International SoC Conference, SoCC 2008, 2008

Online unsupervised pattern discovery in speech using parallelization.
Proceedings of the 9th Annual Conference of the International Speech Communication Association, 2008

Focused prefetching: performance oriented prefetching based on commit stalls.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

Improving Performance of Digest Caches in Network Processors.
Proceedings of the High Performance Computing, 2008

Comprehensive path-sensitive data-flow analysis.
Proceedings of the Sixth International Symposium on Code Generation and Optimization (CGO 2008), 2008

2007
Single-dimension software pipelining for multidimensional loops.
ACM Trans. Archit. Code Optim., 2007

FEADS: A Framework for Exploring the Application Design Space on Network Processors.
Int. J. Parallel Program., 2007

MAX: A Multi Objective Memory Architecture eXploration Framework for Embedded Systems-on-Chip.
Proceedings of the 20th International Conference on VLSI Design (VLSI Design 2007), 2007

A Petri Net Model for Evaluating Packet Buffering Strategies in a Network Processor.
Proceedings of the Fourth International Conference on the Quantitative Evaluaiton of Systems (QEST 2007), 2007

Emulating Optimal Replacement with a Shepherd Cache.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

Packet Reordering in Network Processors.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Compiler-Directed Dynamic Voltage Scaling Using Program Phases.
Proceedings of the High Performance Computing, 2007

An Array Allocation Scheme for Energy Reduction in Partitioned Memory Architectures.
Proceedings of the Compiler Construction, 16th International Conference, 2007

Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation.
Proceedings of the Compiler Construction, 16th International Conference, 2007

MODLEX: A Multi Objective Data Layout EXploration Framework for Embedded Systems-on-Chip.
Proceedings of the 12th Conference on Asia South Pacific Design Automation, 2007

A Scalable Low Power Store Queue for Large InstructionWindow Processors.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

Dynamic Cache Placement with Two-level Mapping to Reduce Conflict Misses.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

Advances in Software Pipelining.
Proceedings of the Compiler Design Handbook: Optimizations and Machine Code Generation, 2007

Instruction Scheduling.
Proceedings of the Compiler Design Handbook: Optimizations and Machine Code Generation, 2007

2006
Area and Power Reduction of Embedded DSP Systems using Instruction Compression and Re-configurable Encoding.
J. VLSI Signal Process., 2006

Exploiting programmable network interfaces for parallel query execution in workstation clusters.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

A scalable low power issue queue for large instruction window processors.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

Two-level mapping based cache index selection for packet forwarding engines.
Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT 2006), 2006

2005
Improving power efficiency with compiler-assisted cache replacement.
J. Embed. Comput., 2005

Performance Modeling and Architecture Exploration of Network Processors.
Proceedings of the Second International Conference on the Quantitative Evaluaiton of Systems (QEST 2005), 2005

A heterogeneously segmented cache architecture for a packet forwarding engine.
Proceedings of the 19th Annual International Conference on Supercomputing, 2005

Offloading Bloom Filter Operations to Network Processor for Parallel Query Processing in Cluster of Workstations.
Proceedings of the High Performance Computing, 2005

2004
Performance analysis of methods that overcome false sharing effects in software DSMs.
J. Parallel Distributed Comput., 2004

CAS-DSM: A Compiler Assisted Software Distributed Shared Memory.
Int. J. Parallel Program., 2004

Single-Dimension Software Pipelining for Multi-Dimensional Loops.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

Code Generation for Single-Dimension Software Pipelining of Multi-Dimensional Loops.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

2003
Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures.
IEEE Trans. Computers, 2003

Optimal Code and Data Layout in Embedded Systems.
Proceedings of the 16th International Conference on VLSI Design (VLSI Design 2003), 2003

Unified Instruction Reordering and Algebraic Transformations for Minimum Cost Offset Assignment.
Proceedings of the Software and Compilers for Embedded Systems, 7th International Workshop, 2003

Compiler-Assisted Cache Replacement: Problem Formulation and Performance Evaluation.
Proceedings of the Languages and Compilers for Parallel Computing, 2003

An Executable Analytical Performance Evaluation Approach for Early Performance Prediction.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Programming Models and System Software for Future High-End Computing Systems: Work-in-Progress.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Exploiting Java-ILP on a Simultaneous Multi-Trace Instruction Issue (SMTI) Processor.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

An Efficient Web Cache Replacement Policy.
Proceedings of the High Performance Computing - HiPC 2003, 10th International Conference, 2003

2002
Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks.
J. VLSI Signal Process., 2002

A Theory for Co-Scheduling Hardware and Software Pipelines in ASIPs and Embedded Processors.
Des. Autom. Embed. Syst., 2002

Power-Performance Trade-Offs for Energy-Efficient Architectures: A Quantitative Study.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

Dynamic Path Profile Aided Recompilation in a JAVA Just-In-Time Compiler.
Proceedings of the High Performance Computing, 2002

Instruction Scheduling.
Proceedings of the Compiler Design Handbook: Optimizations and Machine Code Generation, 2002

2001
Guest Editors' Introduction: Special Issue on Cluster and Network-Based Computing.
J. Parallel Distributed Comput., 2001

Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Hidden Costs in Avoiding False Sharing in Software DSMs.
Proceedings of the High Performance Computing - HiPC 2001, 8th International Conference, 2001

2000
A Vectorizing Compiler for Multimedia Extensions.
Int. J. Parallel Program., 2000

Enhanced Co-Scheduling: A Software Pipelining Method Using Modulo-Scheduled Pipeline Theory.
Int. J. Parallel Program., 2000

A Theory for Software-Hardware Co-Scheduling for ASIPs and Embedded Processors.
Proceedings of the 12th IEEE International Conference on Application-Specific Systems, 2000

1999
Minimum Register Instruction Scheduling: A New Approach for Dynamic Instruction Issue Processors.
Proceedings of the Languages and Compilers for Parallel Computing, 1999

Resource usage models for instruction scheduling: two new models and a classification.
Proceedings of the 13th international conference on Supercomputing, 1999

Resource Usage Modelling for Software Pipelining.
Proceedings of the High Performance Computing, 1999

Efficient State-Diagram Construction Methods for Software Pipelining.
Proceedings of the Compiler Construction, 8th International Conference, 1999

Evaluating Register Allocation and Instruction Scheduling Techniques in Out-Of-Order Issue Processors.
Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 1999

1998
A Unified Framework for Instruction Scheduling and Mapping for Function Units with Structural Hazards.
J. Parallel Distributed Comput., 1998

Performance bounds for distributed memory multithreaded architectures.
Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 1998

An Enhanced Co-Scheduling Method Using Reduced MS-State Diagrams.
Proceedings of the 12th International Parallel Processing Symposium / 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP '98), March 30, 1998

Register-Sensitive Software Pipelining.
Proceedings of the 12th International Parallel Processing Symposium / 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP '98), March 30, 1998

Modulo-variable expansion sensitive scheduling.
Proceedings of the 5th International Conference On High Performance Computing, 1998

1997
Timed Petri net models of multithreaded multiprocessor architectures.
Proceedings of the Seventh International Workshop on Petri Nets and Performance Models, 1997

Distributed Shared Memory on IBM SP2.
Proceedings of the 1997 International Conference on Parallel and Distributed Systems (ICPADS '97), 1997

Classification and performance evaluation of simultaneous multithreaded architectures.
Proceedings of the Fourth International on High-Performance Computing, 1997

A Register Pressure Sensitive Instruction Scheduler for Dynamic Issue Processors.
Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques (PACT '97), 1997

1996
A Framework for Resource-Constrained Rate-Optimal Software Pipelining.
IEEE Trans. Parallel Distributed Syst., 1996

Co-Scheduling Hardware and Software Pipelines.
Proceedings of the Second International Symposium on High-Performance Computer Architecture, 1996

Buffer allocation in regular dataflow networks: an approach based on coloring circular-arc graphs.
Proceedings of the 3rd International Conference on High Performance Computing, 1996

1995
Rate-optimal schedule for multi-rate DSP computations.
J. VLSI Signal Process., 1995

Scheduling and Mapping: Software Pipelining in the Presence of Structural Hazards.
Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation (PLDI), 1995

An Experimental Study of an ILP-based Exact Solution Method for Software Pipelining.
Proceedings of the Languages and Compilers for Parallel Computing, 1995

Design and Performance Evaluation of a Multithreaded Architecture.
Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

1994
Performance of Interconnection Network in Multithreaded Architectures.
Proceedings of the PARLE '94: Parallel Architectures and Languages Europe, 1994

Minimizing register requirements under resource-constrained rate-optimal software pipelining.
Proceedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, USA, November 30, 1994

Minimizing memory requirements in rate-optimal schedules.
Proceedings of the International Conference on Application Specific Array Processors, 1994

1993
Exception Handlers in Functional Programming Languages.
IEEE Trans. Software Eng., 1993

Analysis of Multithreaded Multiprocessors with Distributed Shared Memory.
Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing, 1993

A novel framework for multi-rate scheduling in DSP applications.
Proceedings of the International Conference on Application-Specific Array Processors, 1993

1992
Attempting guards in parallel: A data flow approach to execute generalized guarded commands.
Int. J. Parallel Program., 1992

SMALL: A Scalable Multithreaded Architecture to Exploit Large Localiy.
Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing, 1992

Exploiting instruction-level parallelism: the multithreaded approach.
Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992

Performance Evaluation of Latency Tolerant Architectures.
Proceedings of the Computing and Information, 1992

Well-behaved dataflow programs for DSP computation.
Proceedings of the 1992 IEEE International Conference on Acoustics, 1992

A Large Context Multithreaded Architecture.
Proceedings of the Parallel Processing: CONPAR 92, 1992

Software fault-tolerance in functional programming.
Proceedings of the Sixteenth Annual International Computer Software and Applications Conference, 1992

1991
Data Flow Implementation of Generalized Guarded Commands.
Proceedings of the PARLE '91: Parallel Architectures and Languages Europe, 1991

ParC project: practical constructs for parallel programming languages.
Proceedings of the Fifteenth Annual International Computer Software and Applications Conference, 1991

1990
Lenient Execution and Concurrent Execution of Re-Entrant Routines: Efficient Implementation in Data Flow Systems.
Comput. J., 1990

1989
PROMIDS: A PROtotype multi-rIng data flow system for functional programming languages.
Microprocessing and Microprogramming, 1989

1986
Design and Performance Evaluation of EXMAN: An EXtended MANchester Data Flow Computer.
IEEE Trans. Computers, 1986


  Loading...