Fazl Barez

According to our database¹, Fazl Barez authored at least 33 papers between 2021 and 2024.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

2021

2022

2023

2024

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2024

Best-of-N Jailbreaking.

[BibT_eX]

[DOI]

CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.

[BibT_eX]

[DOI]

CoRR, 2024

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2024

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning.

[BibT_eX]

[DOI]

CoRR, 2024

Towards Interpreting Visual Information Processing in Vision-Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2024

Risks and Opportunities of Open-Source Generative AI.

[BibT_eX]

[DOI]

CoRR, 2024

Visualizing Neural Network Imagination.

[BibT_eX]

[DOI]

CoRR, 2024

Near to Mid-term Risks and Opportunities of Open Source Generative AI.

[BibT_eX]

[DOI]

CoRR, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits.

[BibT_eX]

[DOI]

Philip Quirke

Clement Neo

Fazl Barez

CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

[BibT_eX]

[DOI]

CoRR, 2024

Interpreting Learned Feedback Patterns in Large Language Models.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

Value-Evolutionary-Based Reinforcement Learning.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI.

[BibT_eX]

[DOI]

Proceedings of the Forty-first International Conference on Machine Learning, 2024

Understanding Addition in Transformers.

[BibT_eX]

[DOI]

Philip Quirke

Fazl Barez

Proceedings of the Twelfth International Conference on Learning Representations, 2024

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions.

[BibT_eX]

[DOI]

Clement Neo

Shay B. Cohen

Fazl Barez

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models.

[BibT_eX]

[DOI]

Michael Lan

Philip Torr

Fazl Barez

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Large Language Models Relearn Removed Concepts.

[BibT_eX]

[DOI]

Michelle Lo

Fazl Barez

Shay B. Cohen

Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023

Measuring Value Alignment.

[BibT_eX]

[DOI]

Fazl Barez

Philip H. S. Torr

CoRR, 2023

Locating Cross-Task Sequence Continuation Circuits in Transformers.

[BibT_eX]

[DOI]

Michael Lan

Fazl Barez

CoRR, 2023

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders.

[BibT_eX]

[DOI]

CoRR, 2023

AI Systems of Concern.

[BibT_eX]

[DOI]

CoRR, 2023

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models.

[BibT_eX]

[DOI]

Albert Garde

Esben Kran

Fazl Barez

CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.

[BibT_eX]

[DOI]

CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.

[BibT_eX]

[DOI]

CoRR, 2023

System III: Learning with Domain Knowledge for Safety Constraints.

[BibT_eX]

[DOI]

Fazl Barez

Hosien Hasanbieg

Alesandro Abbate

CoRR, 2023

Fairness in AI and Its Long-Term Implications on Society.

[BibT_eX]

[DOI]

Ondrej Bohdal

Timothy M. Hospedales

Philip H. S. Torr

Fazl Barez

CoRR, 2023

Exploring the Advantages of Transformers for High-Frequency Trading.

[BibT_eX]

[DOI]

CoRR, 2023

Benchmarking Specialized Databases for High-frequency Data.

[BibT_eX]

[DOI]

Fazl Barez

Paul Bilokon

Ruijie Xiong

CoRR, 2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark.

[BibT_eX]

[DOI]

Jason Hoelscher-Obermaier

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python.

[BibT_eX]

[DOI]

Antonio Valerio Miceli Barone

Fazl Barez

Shay B. Cohen

Ioannis Konstas

Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2021

ED2: An Environment Dynamics Decomposition Framework for World Model Construction.

[BibT_eX]

[DOI]

CoRR, 2021

Fazl Barez

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...