@simon The split of 256 experts is interesting as the compute of 8 per token I'm assuming will be ~20B params (plus router model I guess?) which is pretty light weight for the performance in Aider. Having all experts in memory is a very high bar though.