so here's the solution: all the signals in the MCA bus domain go to a latch clocked in that domain (the first "always" block).
then *without any combinational logic* the output of that latch goes *directly* to another latch (the second "always" block) located in the main clock domain.
(i have another flip flop in main clock domain just for detecting the edge)
next step is to optimize the interface speed. right now it takes 25us to read a sector from the SD card but ~5 milliseconds (ouch) to DMA it to the PC!
it's mostly an issue with the Teensy-to-FPGA interface, which is async and simple: 4 address lines, 16 data lines, a read control line, and a write control line. everything else is done as a register in the 4-bit address space. flag register for status and mailbox sync bits.