Fazl Barez

According to our database1, Fazl Barez authored at least 32 papers between 2021 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Best-of-N Jailbreaking.
CoRR, 2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach.
CoRR, 2024

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders.
CoRR, 2024

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning.
CoRR, 2024

Towards Interpreting Visual Information Processing in Vision-Language Models.
CoRR, 2024

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models.
CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.
CoRR, 2024

Risks and Opportunities of Open-Source Generative AI.
CoRR, 2024

Visualizing Neural Network Imagination.
CoRR, 2024

Near to Mid-term Risks and Opportunities of Open Source Generative AI.
CoRR, 2024

Increasing Trust in Language Models through the Reuse of Verified Circuits.
CoRR, 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
CoRR, 2024

Value-Evolutionary-Based Reinforcement Learning.
Proceedings of the Forty-first International Conference on Machine Learning, 2024


Understanding Addition in Transformers.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Large Language Models Relearn Removed Concepts.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

2023
Measuring Value Alignment.
CoRR, 2023

Locating Cross-Task Sequence Continuation Circuits in Transformers.
CoRR, 2023

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders.
CoRR, 2023

AI Systems of Concern.
CoRR, 2023

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models.
CoRR, 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale.
CoRR, 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models.
CoRR, 2023

System III: Learning with Domain Knowledge for Safety Constraints.
CoRR, 2023

Fairness in AI and Its Long-Term Implications on Society.
CoRR, 2023

Exploring the Advantages of Transformers for High-Frequency Trading.
CoRR, 2023

Benchmarking Specialized Databases for High-frequency Data.
CoRR, 2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

2021
ED2: An Environment Dynamics Decomposition Framework for World Model Construction.
CoRR, 2021


  Loading...