@simon The split of 256 experts is interesting as the...

@simon The split of 256 experts is interesting as the compute of 8 per token I'm assuming will be ~20B params (plus router model I guess?) which is pretty light weight for the performance in Aider. Having all experts in memory is a very high bar though.

Like 25 December at 23:09 | Wall-to-wall | Open on reidodon.net