Thus, what gets shifted into the B register also gets shifted right back into the A register...done.
Sounds like 16 D Flip-Flops and I don't think it even needs any gates, just 8 clock pulses.
You connect each Q output (from Q1 through Q15) to the next "D" input (from D2 through D16 respectively) for all16 FlipFlop.
This ignores how the data gets into The A register (the first 8 F-F) in the first place, but this can be shifted into D1. This then requires a few gates to allow either:
1 - Input bits to be loaded into Register "A", serially or
2 - The bits from Q8 to go into D1 when transferring.
This sounds like a single pole, double pole function with gates. With a control input HIGH, the initial data input is loaded into register "A". With that same control input LOW, the data from Q8 is loaded into register "A" instead - for the transfer operation. I'll leave that 'switch' part to you. I'd have to think too hard. (;-)
I hope this helps 'cuz it seems pretty simple to me and I can't think of a simpler way to explain it. YOu can email if you think it needs further explanation.
Hope this helps (and that I minded my D's and Q's . (;-).