One name for this kind of requirement is "Unit of Order".
For example, in Weblogic JMS, it provide such feature that, conceptually, you can put an UOO identifier with your message when you are putting message to queue. Weblogic JMS will guarantee that messages of same UOO being processed in chronological order. However, messages under different UOOs can be processed in parallel.
It all depends on what platform or technology you are going to implement it, which I believe there are a lot of way to achieve similar effect.
I hear threads are expensive. While that isn't an issue yet, I don't want to get into the habit of spawning a new thread for everything if there's a better way.
Threads have overhead. The longer a thread lives, the less of a concern the overhead is. For example, let us assume a thread has a ton of overhead and requires 100ms to spin up. However, the thread lives for an hour. That is a tiny fraction of its life, so who cares? If you can bury the thread initialization in application startup when users expect delays anyway, all the better. If the thread lived for less than a second, then the overhead is a bit much and maybe you should consider alternatives.
That being said, do not spawn threads to do nothing. Each thread should have a purpose.
I'm not sure if I should sacrifice "separation of concerns" to save using threads. In theory, the message receive/broadcast jobs could be combined; although they're separate duties.
In my experience, multithreaded applications need multiple threads. Combining concerns with e.g. a listener and a broadcaster in this case will increase complexity to the point of it being unmanageable. It will create bugs. The key here is where do the threads block? If you have a listener that blocks on accepting a socket, it makes no sense for it also to handle requests. One of its concerns will suffer: requests will be delayed, or new connections might fail. Just create more threads.
Note 1: you should be using thread primitives, not raw threads. Since you tagged this question java I will point to the specific class: ExecutorService. The Executors class provides several factory methods to make it easy to create them for various tasks. The basic idea is you submit tasks (Future or Runnable) and let the framework manage the threads.
This has two primary advantages:
Managing threads can be hard. While it appears easy, there is a lot of boilerplate logic that is easy to screw up, resulting in weird, difficult to replicate (and fix) bugs. Push this responsibility into a robust framework that is well-tested and used in millions of other programs. If there were bugs then Oracle (or Microsoft for .NET, or whatever other vendor) would have found them by now.
Your program's purpose is not "to manage threads" it is "to run a chat server." Let the framework handle the threading concerns, and focus on the chat server aspect. By focusing on the tasks and not the threads, your code is more concise, expressive, and clear.
I wrote up an answer to this old Stack Overflow question that shows a use case for this. I do not want to derail the focus on this answer too much, so you can read more here as well as search both Programmers and SO: Creating an unknown amount of threads in Java?
Note 2: given that this appears to be a desktop/server application as opposed to embedded, there is not much need to be concerned with spawning threads. Modern CPUs (amd64, Intel Core) have multiple cores on-die, typically between two and six. Some models are hyperthreaded or superscalar, adding more logical cores that can execute code concurrently. Add in time slicing with today's high clock speeds, and there is no need to concern oneself with spawning too many threads as long as the threads are reasonable. In the case of your question, the threads appear to be performing small bits of work and possibly blocking on I/O. You are not spawning a million threads for digital video encoding, for example.
Even on embedded architectures, speed and concurrency are not as strong as desktop and server CPUs but should have no problem handling a few threads. My smart phone and tablet both have many applications running in the background and have zero performance problems even with many applications and threads running.
Note 3: if you are interested in learning more about the tradeoffs of a single-threaded algorithm vs. splitting it into pieces and using concurrency, there are a few topics worth exploring. Given the problem described in your question this is not an immediate need, but could make for some interesting and educational reading.
MapReduce is an algorithm for splitting a task into pieces that do not rely on each other, performing those tasks, and joining the results together.
Parallelism in databases involves multiple threads querying data and joining the results. There has been a lot of research into the tradeoffs of using multiple threads for a single database query and a lot of information is out there. Outside of schema/query optimization this is very useful information from a more theoretical perspective.
Best Answer
There are different ways to do it, but if you are inclined to stick with POCO, you may want to look at the macchina.io (OSP portion) WebEvent implementation - it is essentially a pub/sub messaging framework. There's more there than what you need but it's relatively simple and architecturally you should be able to quickly tailor it to your needs. I have used it in production for many years and it works well; it will also be ported in an OSP-independent form to Poco for one of the next releases.
Client can be either (1) a web socket endpoint or (2) an in-process observer which can send (i.e. post events) data and/or subscribe (i.e. receive notifications) to one or more subjects (topics). You'll probably need many in-process observers and one remote endpoint.
The framework runs in two threads handling:
Main queue - responsible for dispatching subscribe/unsubscribe request events from clients.
Worker queue - responsible for dispatching the data events (messages).
Each queue is dealt with in its own thread and there is a dotted-notation naming scheme for subject names, see here for details. Note that documentation only mentions WebSockets but naming works exactly the same for in-process observers and you may want or need a different naming scheme.