C++ – an efficient data structure for syntax highlighting in text editors

Architecturecdata structureseditor

I'm creating a very small text editor in C++ with the ncurses library. So far, it works great. I have implemented the Gap Buffer data structure to make the editing more efficient than a line-based buffer. I have considered the Ropes data structure, but it seemed very complex for this.

However, I'm now trying to implement syntax highlighting and I can't think of a decent way of doing this. Should I highlight the entire file, or just the current view (not highlighting the non-viewable parts of the file)?

Should I give in and use hopes, storing the micro-buffers with style data (bold, color, etc.)?

Best Answer

I don't think you should have syntax highlighting information directly in your text buffer. Instead, I would add additional data structures for the display code.

Here's why:

Once you're providing functionality like selections etc, you'll probably need an anchor concept (a steady pointer to a specific location in the buffer, even when characters are inserted or deleted before that location - see below for implementation idea). If you're dealing with longer texts, you might also need these anchors to provide a fast index into your buffer for line beginnings (since you're using a gap buffer, you need some way to translate line numbers into buffer positions).

What I'm getting at is that you need several supporting data structures besides the actual text buffer in order to provide fast standard editor commands and display the currently visible buffer. Despite your decision to use a gap buffer, editors are line-based and you'll need to support that somehow. So why not write a line-based syntax highlighter which will take the text buffer, an anchor into it (which should ideally be the beginning of a line) and a highlighting state and output a list of ("text fragment", "style information") pairs up to the end of the line, which your display code will use to actually display this line? If it's not fast enough to do on the fly, you can create a cache of these lists indexed by the line number. You could probaby do it for the whole file at once, too, but I suspect that performance will suffer if you do that every time you move around in your file.

You'll need the highlighter state as an input because of multi-line tokens such as long comments or strings. Since usually an editor will display a file from the beginning when it is first opened, you can start with an "all clear" state, call the highlighter code to spit out the first line's highlighting information, and keep the highlighter state at the end of the line (eg, "currently in a multi-line comment" or "all-clear") as the state of the highlighter at the beginning of the next line.

So, basically, I'd suggest implementing anchors and then keeping around an index to quickly translate line numbers into text buffer anchors. Then I'd augment this index to not only provide anchors into the text buffer but also a cache of syntax highlighter output for that line and highlighter state at the the beginning of this line. So if the user is editing a line, you don't have to restart the highlighter at the very top of the buffer; you simply restart highlighting at the beginning of the line; and if you're clever, you can also avoid having to highlight the whole rest of the buffer (I'm thinking you stop highlighting as soon as you reach either the bottom of the visible area or a line with the same state your highlighter is actually in, but flag the following highlighter states as "possibly outdated" if the highlighter state doesn't match the cached one for the next line.

I think you can get away with only highlighting very small sections of your buffer this way, usually just a single line, except for editor commands like "jump to the end of the file", in which case you'll need to syntax-highlight the whole buffer in order to determine the state the highlighter should be in at then end of the file.

Edit: How to implement anchors

To implement anchors, you can use events and listeners. Whenever the buffer gets a command to insert or delete text, it notifies all it's listeners that these events are about to happen. Anchors then subscribe to these events as listeners, and update themselves based on the kind of event. For example, if three characters are inserted before the anchor position, you need to add 3 to the anchor position; if characters are inserted after the anchor position, no action is needed, etc. So, basically, an anchor is an object that keeps track of a buffer offset and subscribes to change events from the gap buffer.