I can think of three ways to do this. I'll start from the most expensive and complicated and work my way down.
- All Logic IC's: Use a parallel-load 9-bit up/down binary counter (such as 3x 74hc190) and a 3-digit resettable BCD counter (3x 74hc163). Load the binary counter with your 9-bit number, and clear the BCD counter to 0. Clock both counters until the binary counter reaches zero. Load the output of the BCD counter into a latch, and then feed it to a BCD-7seg decoder. Repeat.
- Pros:
- Meets requirement of only using standard logic IC's
- Requires no programming
- Cons:
- Expensive (The counter chips are about $1-2 each)
- Prone to wiring failure
- Requires a clock signal of 31kHz for 60Hz refresh
- Requires 8-12 Logic IC's (I'll draw a schematic later)
- ROM Look-Up Table: Use a small parallel asynchronous FLASH memory (such as this one) to take all data inputs as an address, and then program the ROM to generate BCD outputs for 2 digits. Feed the result into a BCD-to-7seg decoder. Alternatively, use one ROM to generate a single decimal digit decoded into 7-segment pins.
- Pros:
- Fairly cheap price per digit
- Scales better
- Fast (but speed doesn't really matter)
- Simple to wire/design
- Cons:
- Each ROM requires different programming
- Need to buy a much larger ROM than needed
- Programming requires computer automation.
- Microcontroller: A simple microcontroller with enough inputs and outputs can convert the binary number into BCD, and then encode it into 7-segment control signals. The cheapest solution (My digikey search picked out this PIC) will multiplex the output digits. You may need transistors to drive the common anode/cathode of your 7-segment displays, but those can be cheap transistors.
- Pros:
- The cheapest solution at $1.50 - $2.00 total
- Simple Wiring
- Ease of implementing Binary->BCD algorithm in software
- Easy to add more functionality
- Least expensive system for driving the display
- Cons:
- A chip programmer is required
- You need to write software (not just a hardware problem)
- The cheapest solution requires digit multiplexing, which is more complicated than direct drive.
I'll draw up some schematics later
Lookup tables have been mentioned in comments. There are two approaches.
Fast
Create a 256 bytes long table, with every next value the square root of the corresponding index. This is fast since you use the argument as index to access the right value directly. Drawback is that it needs a long table, with lots of duplicate values.
Compact
As said, an 8-bit integer can only have values 0 through 255, and the corresponding square roots are 0 through 16 (rounded). Construct a 16 entry table (zero-based) with the n-th entry the maximum value for the argument for which the square root is n. Table would look like this:
0
2
6
12
20
etc.
You walk through the table and stop when you encounter a value greater than or equal to your argument. Example: square root of 18
set index to 0
value[0] = 0, is less than 18, go to the next entry
value[1] = 2, is less than 18, go to the next entry
value[2] = 6, is less than 18, go to the next entry
value[3] = 12, is less than 18, go to the next entry
value[4] = 20, is greater than or equal to 18, so sqrt(18) = 4
While the fast lookup table has a fixed execution time (just one lookup), here the execution time is longer for higher value arguments.
For both methods goes that, by choosing different values for the table, you can select between a rounded or a truncated value for the square root.
Best Answer
Probably one of the cheapest and certainly one of the fastest options would be to use a CPLD such as this Xilinx one. The 64 pin package should just do, but I suggest the 100 pin package to leave some room. As good as 5ns response pin-to-pin, single chip solution and only $2.86 quantity 1. You can decide what output function (count the 1's, count the 0's, priority encoder, inverted-input priority encoder or whatever) at any time and reprogram it.
Unlike most FPGAs, these CPLD parts contain internal non-volatile flash memory, so you don't need an external program storage component.
Unfortunately, the learning curve is a bit steep for the require 7G bytes (or whatever) of (free) software, the HDL (Verilog or VHDL) or schematic method you'd use to describe the functionality, and you need a JTAG-USB programmer (cheap clones are available), but once you learn it, you'll probably find many other applications.