Xiaobing Feng

Orcid: 0000-0003-2909-7750

Affiliations:
  • Chinese Academy of Sciences, Institute of Computing Technology, State Key Lab of Computer Architecture, Beijing, China
  • University of Chinese Academy of Sciences , Beijing, China


According to our database1, Xiaobing Feng authored at least 99 papers between 2004 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs.
ACM Trans. Archit. Code Optim., March, 2024

A Tale of Two Paths: Toward a Hybrid Data Plane for Efficient Far-Memory Applications.
Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, 2024

Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the-Fly Micro-Kernel Polymerization.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2023
Automatic Target Description File Generation.
J. Comput. Sci. Technol., December, 2023

Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions.
Dataset, October, 2023

VTensor: Using Virtual Tensors to Build a Layout-Oblivious AI Programming Framework.
J. Comput. Sci. Technol., September, 2023

Facilitating hardware-aware neural architecture search with learning-based predictive models.
J. Syst. Archit., April, 2023

Portable and Scalable All-Electron Quantum Perturbation Simulations on Exascale Supercomputers.
Proceedings of the International Conference for High Performance Computing, 2023

Honeycomb: Secure and Efficient GPU Executions via Static Validation.
Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation, 2023

SIRIUS: Harvesting Whole-Program Optimization Opportunities for DNNs.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

OPTango: Multi-central Representation Learning against Innumerable Compiler Optimization for Binary Diffing.
Proceedings of the 34th IEEE International Symposium on Software Reliability Engineering, 2023

Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
CloudRaid: Detecting Distributed Concurrency Bugs via Log Mining and Enhancement.
IEEE Trans. Software Eng., 2022

Scaling Poisson Solvers on Many Cores via MMEwald.
IEEE Trans. Parallel Distributed Syst., 2022

An Application-oblivious Memory Scheduling System for DNN Accelerators.
ACM Trans. Archit. Code Optim., 2022

Optimizing deep neural networks on intelligent edge accelerators via flexible-rate filter pruning.
J. Syst. Archit., 2022

2021
Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories.
ACM Trans. Comput. Syst., 2021

Compiler-assisted Operator Template Library for DNN Accelerators.
Int. J. Parallel Program., 2021

Pinpointing the Memory Behaviors of DNN Training.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

Understanding the Runtime Overheads of Deep Learning Inference on Edge Devices.
Proceedings of the 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), New York City, NY, USA, September 30, 2021

LoWino: Towards Efficient Low-Precision Winograd Convolutions on Modern CPUs.
Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021

2020
ParaML: A Polyvalent Multicore Accelerator for Machine Learning.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Fusion-Catalyzed Pruning for Optimizing Deep Learning on Intelligent Edge Devices.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

DNNTune: Automatic Benchmarking DNN Models for Mobile-cloud Computing.
ACM Trans. Archit. Code Optim., 2020

Referee: A Pattern-Guided Approach for Auto Design in Compiler-Based Analyzers.
Proceedings of the 27th IEEE International Conference on Software Analysis, 2020

Compiler-Assisted Operator Template Library for DNN Accelerators.
Proceedings of the Network and Parallel Computing, 2020

Characterizing the I/O Pipeline in the Deployment of CNNs on Commercial Accelerators.
Proceedings of the IEEE International Conference on Parallel & Distributed Processing with Applications, 2020

Lance: efficient low-precision quantized winograd convolution for neural networks based on graphics processing units.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs.
Proceedings of the Euro-Par 2020: Parallel Processing, 2020

VTensor: Using Virtual Tensors to Build a Layout-oblivious AI Programming Framework.
Proceedings of the PACT '20: International Conference on Parallel Architectures and Compilation Techniques, 2020

Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory.
Proceedings of the PACT '20: International Conference on Parallel Architectures and Compilation Techniques, 2020

2019
Cacheap: Portable and Collaborative I/O Optimization for Graph Processing.
J. Comput. Sci. Technol., 2019

ElasticActor: An Actor System with Automatic Granularity Adjustment.
Int. J. Parallel Program., 2019

Understanding Node Change Bugs for Distributed Systems.
Proceedings of the 26th IEEE International Conference on Software Analysis, 2019

CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis.
Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019

Exploiting the input sparsity to accelerate deep neural networks: poster.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

Panthera: holistic memory management for big data processing over hybrid memories.
Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019

Accelerating GPU Computing at Runtime with Binary Optimization.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019

PPOpenCL: a performance-portable OpenCL compiler with host and kernel thread code fusion.
Proceedings of the 28th International Conference on Compiler Construction, 2019

XDN: Towards Efficient Inference of Residual Neural Networks on Cambricon Chips.
Proceedings of the Benchmarking, Measuring, and Optimizing, 2019

Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018
Using Local Clocks to Reproduce Concurrency Bugs.
IEEE Trans. Software Eng., 2018

NVM Streaker: a fast and reconfigurable performance simulator for non-volatile memory-based memory architecture.
J. Supercomput., 2018

RARE: An Efficient Static Fault Detection Framework for Definition-Use Faults in Large Programs.
IEEE Access, 2018

CloudRaid: hunting concurrency bugs in the cloud via log-mining.
Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018

Lazygraph: lazy data coherency for replicas in distributed graph-parallel computation.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

On Retargeting the AI Programming Framework to New Hardwares.
Proceedings of the Network and Parallel Computing, 2018

Background Subtraction on Depth Videos with Convolutional Neural Networks.
Proceedings of the 2018 International Joint Conference on Neural Networks, 2018

Characterizing DNN Models for Edge-Cloud Computing.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

Revisiting Loop Tiling for Datacenters: Live and Let Live.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Auto-tuning Neural Network Quantization Framework for Collaborative Inference Between the Cloud and Edge.
Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2018, 2018

Fast CNN Pruning via Redundancy-Aware Training.
Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2018, 2018

May-happen-in-parallel analysis with static vector clocks.
Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018

2017
Locating Software Faults Based on Minimum Debugging Frontier Set.
IEEE Trans. Software Eng., 2017

An Accelerator for High Efficient Vision Processing.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2017

Parallel Incremental Frequent Itemset Mining for Large Data.
J. Comput. Sci. Technol., 2017

Two-Level Task Scheduling for Irregular Applications on GPU Platform.
Int. J. Parallel Program., 2017

2016
Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis.
IEEE Trans. Parallel Distributed Syst., 2016

Pragma Directed Shared Memory Centric Optimizations on GPUs.
J. Comput. Sci. Technol., 2016

Articulation points guided redundancy elimination for betweenness centrality.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

Efficient Management for Hybrid Memory in Managed Language Runtime.
Proceedings of the Network and Parallel Computing, 2016

2015
WiseThrottling: a new asynchronous task scheduler for mitigating I/O bottleneck in large-scale datacenter servers.
J. Supercomput., 2015

Practical Iterative Optimization for the Data Center.
ACM Trans. Archit. Code Optim., 2015

ShiDianNao: shifting vision processing closer to the sensor.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

ReCBuLC: Reproducing Concurrency Bugs Using Local Clocks.
Proceedings of the 37th IEEE/ACM International Conference on Software Engineering, 2015

Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

PuDianNao: A Polyvalent Machine Learning Accelerator.
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014
Dynamic I/O-Aware Scheduling for Batch-Mode Applications on Chip Multiprocessor Systems of Cluster Platforms.
J. Comput. Sci. Technol., 2014

Group Orbit Optimization: A Unified Approach to Data Normalization.
CoRR, 2014

Concurrency bug localization using shared memory access pairs.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Localization of concurrency bugs using shared memory access pairs.
Proceedings of the ACM/IEEE International Conference on Automated Software Engineering, 2014

A collaborative divide-and-conquer K-means clustering algorithm for processing large data.
Proceedings of the Computing Frontiers Conference, CF'14, 2014

2013
Layout-oblivious compiler optimization for matrix computations.
ACM Trans. Archit. Code Optim., 2013

Effective fault localization based on minimum debugging frontier set.
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013

An empirical model for predicting cross-core performance interference on multicore processors.
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013

2012
Extendable pattern-oriented optimization directives.
ACM Trans. Archit. Code Optim., 2012

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs.
J. Comput. Sci. Technol., 2012

Can We Make It Faster? Efficient May-Happen-in-Parallel Analysis Revisited.
Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012

A Highly Parallel Reuse Distance Analysis Algorithm on GPUs.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Layout-oblivious optimization for matrix computations.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

Making it practical and effective: fast and precise may-happen-in-parallel analysis.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Dependence-based multi-level tracing and replay for wireless sensor networks debugging.
Proceedings of the ACM SIGPLAN/SIGBED 2011 conference on Languages, 2011

Automatic Library Generation for BLAS3 on GPUs.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Parallelizing a machine translation decoder for multicore computer.
Proceedings of the Seventh International Conference on Natural Computation, 2011

2010
Landing Stencil Code on Godson-T.
J. Comput. Sci. Technol., 2010

Continuous speculative program parallelization in software.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors.
Proceedings of the Network and Parallel Computing, IFIP International Conference, 2010

Level by level: making flow- and context-sensitive pointer analysis scalable for millions of lines of code.
Proceedings of the CGO 2010, 2010

An adaptive task creation strategy for work-stealing scheduling.
Proceedings of the CGO 2010, 2010

2009
PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization.
J. Comput. Sci. Technol., 2009

Detecting and Eliminating Potential Violations of Sequential Consistency for Concurrent C/C++ Programs.
Proceedings of the CGO 2009, 2009

2008
Exploiting idle register classes for fast spill destination.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

Global Tiling for Communication Minimal Parallelization on Distributed Memory Systems.
Proceedings of the Euro-Par 2008, 2008

2006
Library Function Disposing Approach in Binary Translation.
J. Comput. Res. Dev., 2006

Global Partial Replicate Computation Partitioning.
J. Comput. Res. Dev., 2006

2005
Integrating Parallelizing Compilation Technologies for SMP Clusters.
J. Comput. Sci. Technol., 2005

2004
An Overview of the Open Research Compiler.
Proceedings of the Languages and Compilers for High Performance Computing, 2004


  Loading...