Why is AI Safety Challenging?

Safety for weak capability?

We are postulating that in near future AI capabilities will grow so much, that it will be superior to human intelligence. In such a reality, enforcing some mechanism that allows AI to still work for the progress of humanity will be necessary. I have been reading the book “Superintelligence” and an opinion that exists among researchers and thinkers, is that we have once chance to build that level of AI that surpasses human intelligence and when we get there, if we don’t do it in a safe way, that could possibly mean in worst case end of humanity.

So why is safety so challenging? AI is not new, the field has been there since early 19th century, the recent hardware enhancement has made it possible to train some really large models progressing from simple linear regressions. In that process, we have built black box with billions of parameters that works for tasks like classification, prediction of next tokens and more.

We still at large don’t understand what the underlying features the model has learnt. If we don’t know a system, if we don’t understand it’s capabilities, how can we make sure it is safe?

If we can break down LLM === large set of rules or features learnt from data, that is any ML model can be decomposed into huge set of some mathematical relations, that can represent features needed to perform the task.

The problem is while in theory we understand or have some intuition that this is what is happening when we train a model with gradient descent, it is very difficult to decompose the LLM model into those features and to point out which features are stored in which part of the model. This is also challenging because of an idea of supercomposition, that is there are a lot more features than there are places to hold those features in the model and hence the model packs multiple features into single place, i.e neurons in cases of Neural networks.

Presence of supercomposition means, we cannot just map out which neurons or places activates in a network or stimulates for an input consisting particular feature. We can have the same neuron activating for multiple feature and we can have multiple neurons activating for single feature.

Safety for weak capability?#

Safety for weak capability?