I had a go at implementing this. The key, as you've noticed, is to decompose the problem into steps where you're working with no more than a handful of values at once.
Don'write a big loop, instead break it up into multiple iterations over the data and use library words where possible.