I think implementing the CAN protocol in firmware only will be difficult and will take a while to get right. It is not a good idea.
However, your prices are high. I just checked, and a dsPIC 33FJ64GP802 in QFN package sells for 3.68 USD on microchipdirect for 1000 pieces. The price will be lower for real production volumes.
The hardware CAN peripheral does some real things for you, and the price increment for it is nowhere near what you are claiming.
Added:
Since you seem to be determined to try the firmware route, here are some of the obvious problems that pop to mind. There will most likely be other problems that haven't occured to me yet.
You want to do CAN at 20 kbit/s. That's a very slow rate for CAN, which go up to 1Mbit/s for at least 10s of meters. To give you one datapoint, the NMEA 2000 shipboard signalling standard is layerd on CAN at 200 kbits/s, and that's meant to go from one end of a large ship to the other.
You may think that all you need is one interrupt per bit and you can do everything you need in that interrupt. That won't work because there are several things going on in each CAN bit time. Two things in particular need to be done at the sub-bit level. The first is detecting a collision, and the second is adjusting the bit rate on the fly.
There are two signalling states on a CAN bus, recessive and dominant. Recessive is what happens when nothing is driving the bus. Both lines are pulled together by a total of 60 Ω. A normal CAN bus as implemented by common chips like the MCP2551, should have 120 Ω terminators at both ends, hence a total of 60 Ω pulling the two differential lines together passively. The dominant state is when both lines are actively pulled apart, somewhere around 900mV from the recessive state if I remember right. Basically, this is like a open collector bus, except that it's implemented with a differential pair. The bus is in recessive state if CANH-CANL < 900mV and dominant when CANH-CANL > 900mV. The dominant state signals 0, and the recessive 1.
Whenever a node "writes" a 1 to the bus (lets it go), it checks to see if some other node is writing a 0. When you find the bus in dominant state (0) when you think you're sending and the current bit you're sending is a 1, then that means someone else is sending too. Collisions only matter when the two senders disagree, and the rule is that the one sending the recessive state backs off and aborts its message. The node sending the dominant state doesn't even know this happened. This is how arbitration works on a CAN bus.
The CAN bus arbitration rules mean you have to be watching the bus partway thru every bit you are sending as a 1 to make sure someone else isn't sending a 0. This check is usually done about 2/3 of the way into the bit, and is the fundamental limitation on CAN bus length. The slower the bits rate, the more time there is for the worst case propagation from one end of the bus to the other, and therefore the longer the bus can be. This check must be done every bit where you think you own the bus and are sending a 1 bit.
Another problem is bit rate adjustment. All nodes on a bus must agree on the bit rate, more closely than with RS-232. To prevent small clock differences from accumulating into significant errors, each node must be able to do a bit that is a little longer or shorter than its nominal. In hardware, this is implemented by running a clock somewhere around 9x to 20x faster than the bit rate. The cycles of this fast clock are called time quanta. There are ways to detect that the start of new bits is wandering with respect to where you think they should be. Hardware implementations then add or skip one time quanta in a bit to re-sync. There are other ways you could implement this as long as you can adjust to small diferences in phase between your expected bit times and actual measured bit times.
Either way, these mechanisms require various things be done at various times within a bit. This sort of timing will get very tricky in firmware, or will require the bus to be run very slowly. Let's say you implement a time quanta system in firmware at 20 kbits/s. At the minimum of 9 time quanta per bit, that would require 180 kHz interrupt. That's certainly possible with something like a dsPIC 33F, but will eat up a significant fraction of the processor. At the max instruction rate of 40 MHz, you get 222 instruction cycles per interrupt. It shouldn't take that long to do all the checking, but probably 50-100 cycles, meaning 25-50% of the processor will be used for CAN and that it will need to preempt everything else that is running. That prevents many applications these processors often run, like pulse by pulse control of a switching power supply or motor driver. The 50-100 cycle latency on every other interrupt would be a complete show stopper for many of the things I've done with chips like this.
So you're going to spend the money to do CAN somehow. If not in the dedicated hardware peripheral intended for that purpose, then in getting a larger processor to handle the significant firmware overhead and then deal with the unpredictable and possible large interrupt latency for everything else.
Then there is the up front engineering. The CAN peripheral just works. From your comment, it seems like the incremental cost of this peripheral is $.56. That seems like a bargain to me. Unless you've got a very high volume product, there is no way you're going to get back the considerable time and expense it will take to implement CAN in firmware only. If your volumes are that high, the prices we've been mentioning aren't realistic anyway, and the differential to add the CAN hardware will be lower.
I really don't see this making sense.
Background Information
I have used CAN a few times now for multiple devices distributed over a physically small area, like within a few 10s of meters. In each case, the CAN bus was internal to the system and we could specify exactly what the protocol over CAN would be. None
of these systems had to, for example, interface with OBDII, NMEA2000, etc, where a specific protocol was already defined. One case was a large industrial machine that required lots of distributed sensors and actuators. The outside world interface just dealt with the overall operation of the machine. How the controller got the sensor information and caused the actuators to do stuff was a internal implementation choice that we happened to use CAN for. In another case, a company needed a good way for their customers to control multiple (up to a few dozen) of the gizmos they make within a single larger system. In this case we specified CAN as one communication means and documented the protocol. This protocol would be implemented by the controller of this system, but not surfaced to the end customer which bought this system as a whole and communicated with it thru different means at a higher level.
The EmCan solution
I have converged on a way of dealing with this over several implementations. I am now in the middle of two more such implementations, and this time I decided to use the previous experience to create a formal spec for a infrastructure layer immediately above CAN. CAN is a well designed protocol as far as it goes, and is directly implemented in a number of microcontrollers nowadays. It seems a natural way to connect multiple little devices over a limited physical distance as long as the data rate isn't too high. Basically, it can do everything you probably would have used RS-485 for 20 years ago, except that more protocol layers are specified, the specification makes sense, and hardware implementations are available built into low cost microcontrollers.
The result of this is what I call EmCan (EMbed CAN). I am slowly filling out the formal protocol specification as I migrate code from the previous implementations, generalize the concepts a bit, and make re-usable firmware modules where the EmCan protocol code can be used without change accross a variety of projects. I'm not really ready to officially publish the spec yet and provide the reference implementations, but you can look at what is there to see where things are heading. The current document is a work in progress, as it itself says.
So far I have PIC 18 and dsPIC 33 implementations of the EmCan device side, a stripped down host implementation for PIC 18, and a more full (more things handled locally) implementation for the dsPIC 33. Everything documented in the current version is implemented and seems to be working. I am working on the byte stream interface right now. I did this before in one of the previous systems, but it was more tied into the application and not a nice separable layer like EmCan.
The issue with a switched load
I think trying to switch the CAN bus with FETs or analog switches is a really bad idea. The main reason for the bit rate versus length tradeoff is not the total resistance of the cable, but the round trip propagation. Look at how CAN detects collisions, and you will see this mechanism assumes signal propagation from one end to the other within a fraction of a bit time. The CAN bus needs to be kept a transmission line. For most implementations, such as when using the common MCP2551 bus driver, the characteristic impedance should be close to 120 Ω. That means a 120 Ω resistor at each end of the bus, so any point on the bus looks like a 60 Ω load.
How EmCan fixes this
EmCan solves the node address problem without requiring special hardware. For details, see the EmCan spec, but basically, each node has a globally unique 7 byte ID. Each node periodically requests a bus address and sends this ID. The collision detection mechanism will guarantee that the bus master sees only one of these requests even if multiple nodes send a address request at the same time. The bus master sends a address assignment message that includes the 7 byte ID and the assigned address, so at most one single node is assigned a new address at a time.
If you are interested in this concept and are willing to discuss details of your system, talk to me. My biggest fear right now is specifying something that will be awkward later or prohibit certain usage that I hadn't considered. Having another implementation in progress as the spec is being finalized would be good for spec development and to test out the reference implemenation if you plan to implement it on Microchip PICs.
Best Answer
The intermediary nodes that attach to the main network bus should be short in order NOT to disrupt the end-to-end characteristic impedance of the main network bus cable. This is to avoid unwanted data signal reflections. The main cable should be terminated at both physical ends and, the intermediary nodes that come from that cable should not be terminated.
Using the same cable for the "short" nodes is immaterial; the important thing is that those short lengths teeing from the main cable are short relative to the wavelength of the maximum frequencies used in the data transmission.
For instance, if transmitting at 2 Mbps, you could argue that the maximum useful frequency in that transmission (using fourier analysis) might be 7 MHz or 9 MHz.
9 MHz has a wavelength of 33.33 metres in free space and probably about 25 metres in a decent cable. The rule of thumb used to decide if a cable should be terminated is one tenth of the shortest useful signal wavelength i.e. 2.5 metres. But, given that there may be a multitude of these intermediary connections (all contributing a slight mismatch), play safe and make the teed-off length no more than 0.5 metres.