Hey I've got a great analogy! Imagine ARM processors are multimedia devices that have just the MP3 decoder, and X86 processors are devices that have all multimedia decoders imaginable natively built-in. You'd be all awesome on both as long as you're playing MP3 files, maybe the ARM processor will even play these a bit faster because all optimizations go into maximizing the performance of that one format. But then you suddenly get a bunch of FLAC files that you'd like to play, and the x86 processor plays them amazingly like it's nothing. But the ARM processor is fucked, it has to spend a lot of time converting each of those FLAC files to MP3 first, and then play them as MP3 files trying to mimic the lossless FLAC format as best as it can without being really capable of doing so.
That's precisely how those processor technologies differ. In that world it'd be in Apple's interest to compare their processors to others exclusively in MP3 playback performance, and since let's say 80% of users buy their products just to listen to MP3s, it'd work just well enough. Run into any other workloads and you end up being 10 times slower, not as good at, or you can't handle them at all. Sometimes it's easy, akin to converting an AAC into MP3, just requiring more time. And sometimes it's ridiculous gymnastics like converting a video to MP3 + a synced slideshow to create an illusion of having video support that in that imaginary world it completely lacks.
Following that analogy you could add extra transistors to your ARM chips to natively support FLAC or video playback, but there are hundreds/thousands of other formats that x86 chips support natively for you to do the same thing for. Then you realize by the time you're done you'll end up with a much more bloated processor than Intel's or AMD's because these had decades to optimize each of those decoders and they are engineered to perfection at this point, having squeezed every smallest efficiency out of each of them over the years. Which is why everyone's just letting ARM be ARM, drawing the line on bare necessities and assuming it's going to be a way narrower architecture meant to do a few things well, remaining small, light and as effective as possible at the few things it's able to natively do.
I think this illustrates the issue perfectly, except replace MP3/FLAC with more complex stuff, such as "decoders" aimed to natively process specific complex algorithms (encryption, vectors etc.). There is a huge array of complex commands, each of which x86 can execute as a single operation (1 "clock cycle"). An ARM chip may be able to get you the same result, but instead of each being a single operation it has to break it down into myriads of many simple steps, each of them requiring their own clock cycle, taking many clock cycles to complete the whole thing instead.