The biggest issue I see with your current plan is tat long series of 0s or 1's (multiple 0x00 or oxFF bytes) will be difficult to differentiate. Here are some other ideas for consideration.
Bitbanged UART TX
Bit banging a UART TX line isn't terribly hard. Pick a a slower baud rate that is easily divides your system clock. You'll still be transmitting bytes in a similar fashion as your code sample, just with a set delay between each bit and a leading start and trailing stop bit for each byte. This has the advantage of being able to be directly connected to a computer to receive all the data. Alternatively, newer logic analyzers can usually decode the transmission directly into bytes without you needing to do anything by hand.
UNI/O
Microchip has a single line communication scheme called UNI/O. It runs at a range of clock rates where the master basically toggle the line at a given rate a few times and then all communication takes place at that rate.
A bit value is then transmitted by the rising or falling edge taking place in the middle of the clock period. A High to low transition would be a zero bit and a low to high transition would be a one bit. You can read more about UNI/O here
I've worked on AVRs as well as ARM Cortex-M3/M4/R4-based MCUs. I think I can offer some general advice. This will assume you're programming in C, not assembly.
The CPU is actually the easy part. The basic C data types will be different sizes, but you're using uint8/16/32_t anyway, right? :-) And now all integer types should be reasonably fast, with 32-bit (int) being the fastest. You probably don't have an FPU, so continue to avoid floats and doubles.
First, work on your understanding of the system-level architecture. This means IOs, clocking, memory, resets, and interrupts. Also, you need to get used to the idea of memory-mapped peripherals. On AVR you can avoid thinking about that because the registers have unique names with unique global variables defined for them. On more complex systems, it's common to refer to registers by a base address and an offset. It all boils down to pointer arithmetic. If you're not comfortable with pointers, start learning now.
For IOs, figure out how the peripheral muxing is handled. Is there a central mux control to select which pins are peripheral signals and which are GPIOs? Or do you set pins to peripheral mode using the peripheral registers? And of course you'll need to know how to configure GPIOs as inputs and outputs, and enable open-drain mode and pull-ups/downs. External interrupts usually fall into this category as well. GPIOs are pretty generic, so your experience should serve you well here.
Clocking boils down to a few things. You start with a clock source, typically a crystal or internal RC oscillator. This is used to create one or more system-level clock domains. Higher-speed chips will use a PLL, which you can think of as a frequency multiplier. There will be also clock dividers at various points. They key things to consider are what your CPU clock frequency should be and what bit rates you need for your communication peripherals. Usually this is pretty flexible. When you get more advanced, you can learn about things like low-power modes, which are usually based on clock gating.
Memory means flash and RAM. If you have enough RAM, it's often faster to keep your program there during early development so you don't have to program the flash over and over. The big issue here is memory management. Your vendor should provide sample linker scripts, but you might need to allocate more memory to code, constants, global variables, or the stack depending on the nature of your program. More advanced topics include code security and run-time flash programming.
Resets are pretty straightforward. Usually you only have to look out for the watchdog timer, which may be enabled by default. Resets are more important during debugging when you run the same code over and over. It's easy to miss a bug due to sequencing issues that way.
There are two things you need to know about interrupts -- how you enable and disable them, and how you configure the interrupt vectors. AVR-GCC does the latter for you with the ISR() macros, but on other architectures you might have to write a function address to a register manually.
Microcontroller peripherals are usually independent of each other, so you can learn them one at a time. It might help to pick one peripheral and use it to learn part of the system-level stuff. Comm peripherals and PWMs are good for clocking and IOs, and timers are good for interrupts.
Don't be intimidated by the level of complexity. Those "basic" microcontrollers have already taught you much of what you need to know. Please let me know if you need me to clarify anything.
Best Answer
There will be delay accessing registers of those peripherals since you have to wait for the bus (AHB).
Most of the time they just run full speed. Unless the energy budget requires otherwise.
That depends on the peripherals you have enabled. Some peripherals are hungry, some are not. This depends on their complexity.
The datasheet will have a table with the peripherals uA/MHz rating.
For example in the STM32F072:
Looking at these number might have you decide that, when you're only using a timer to output 100 Hz PWM to put that bus at 1 MHz instead of 48 MHz going to the core itself. And maybe not use TIM1 or TIM2.
However, this affects all peripherals on that bus. Including CAN or UART, and depending on the chip complexity the memories.
There isn't much to calculate. You will have sysclk from the PLL, and from then on there are only dividers. Find the clock tree in the reference manual, and play around with the clocks page in STMCubeMX.