The second one is often screwed flat on a PCB, with the transistor's legs at 90 degree angles, and then of course you put the transistor on the narrow side.
But otherwise it doesn't matter. The heatsink will receive its heat at the same place and the fins don't really care whether the heat entered from the left or the right.
You can use many (but not any) type of BJT and get good results. You should not use general parts like 1N400x 1N4148 1N914 diodes or rectifiers or RF BJTs or OC71 germanium transistors or massive 2N3055 power transistors for this kind of \$\Delta\$-Vbe circuit.
The measurement principle here is to measure the difference in the diode-connected-transistor forward drop at two currents perhaps a decade apart, which is far more predictable than simple Vbe measurement. The difference has a well-defined behavior and unadjusted error can be less than 1°C, even for random (suitable) transistors. That's impossible with a simple Vbe measurement, and of course we always want to avoid individual calibration.
The trade-off is more complexity (it's all on one chip, so not your problem) and about 1/10 the voltage sensitivity (more like -200uV/°C than the -2mV/°C that everyone knows), which requires auto-zero circuitry on the chip.
A diode-connected transistor behaves much more like an ideal diode than, say, a 1N4148. In particular, the ideality factor \$n\$ is 1 (typically something like 1.004 for a 2N3904) rather than somewhere between 1 and 2. For this reason you'll also find diode-connected transistors used in log and antilog circuits.
\$\Delta V_{BE} = n \frac{kT}{q} \ln(\frac{I_{HIGH}}{I_{LOW}})\$
If \$n\$ = 1.0, kT/q = T * 8.61E-5, Ihigh/Ilow = 10 then
\$\Delta V_{BE} = 198\mu \$V/°C
Using a diode will give you 50-100% error. In absolute temperature.
The other factor that affects accuracy of this kind of circuit is the base resistance. To minimize this error, use a medium power transistor such as a 2N4401 or 2N4403 or 2SC1815 or C8050 etc. etc. (PNP or NPN will both work since it's a 2-lead connection). Silicon types only, of course. You could use a higher power transistor if you want a tab to bolt down, but leakage may begin to affect the measurement at very high temperatures.
Best Answer
The Sziklai topology places the output BJTs within a local NFB loop. It's enough that the quiescent current is approximately 20 times less sensitive to the output BJT temperature changes. Partly because of this, you do NOT need to include the power output BJTs on a shared and monitored heat sink.
It's the driver BJTs that need to be monitored for temperature and used to adjust the \$V_\text{BE}\$ multiplier. And that's handy because their heat sink can be quite small and therefore also the thermal time constants will be quite short (which is a positive thing.)
You can (mostly) ignore the output BJT temperatures (though you still need to allow them to dissipate properly, of course.) It would be detrimental to allow their dissipation to affect the \$V_\text{BE}\$ multiplier. It's the base-emitter junctions of the driver BJTs that need to be tracked and not that of the output BJTs.
The Sziklai design arrangement allows the driver BJTs to undergo somewhat wider temperature variations as the output power demands change (than is the case with the Darlington arrangement.) So it's a little more important to do good thermal tracking with the \$V_\text{BE}\$ multiplier. But it should only be observing the driver BJTs (often by putting the \$V_\text{BE}\$ multiplier BJT(s) on the same heat sink.) Keeping the heat sink small (thermal mass is "light"), the "system" should respond more quickly to changes. You don't want the output BJTs' dissipation messing that up. Just allow the output BJTs to have their own dissipation heat sinks and keep them away from the driver BJTs, where possible.