Samuel Marks

According to our database1, Samuel Marks authored at least 9 papers between 2023 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Erasing Conceptual Knowledge from Language Models.
CoRR, 2024

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability.
CoRR, 2024

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models.
CoRR, 2024

NNsight and NDIF: Democratizing Access to Foundation Model Internals.
CoRR, 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.
CoRR, 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models.
CoRR, 2024

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
CoRR, 2024

2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.
Trans. Mach. Learn. Res., 2023

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.
CoRR, 2023


  Loading...