Yeah Anthropic has a whole research department for this
https://www.anthropic.com/research/team/interpretability
https://www.anthropic.com/research/tracing-thoughts-language-model
And you’re exactly right. Models at this point are like a trillion floats in complex vectorized matrix math and we don’t really know how that works to produce the output we see


Breeding farms. That’s what the dark enlightenment movement wants