llama3 implementation one matrix multiplication at a time