Work in progress. going to keep some links here for now.
Paid Medium article he covers the steps, but I want to do this without hugging face https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6
Under Prerequisites he has some useful breakdowns of encoder, decoder, multi head aattention https://machinelearningmastery.com/implementing-the-transformer-decoder-from-scratch-in-tensorflow-and-keras/
Another similar example, at the end he demonstrates word prediction https://thepythoncode.com/article/pretraining-bert-huggingface-transformers-in-python#google_vignette?