Are there concurrent programming techniques and practices that one should no longer use? I'd say yes.
One early concurrent programming technique that seems rare nowadays is interrupt-driven programming. This is how UNIX worked in the 1970s. See the Lions Commentary on UNIX or Bach's Design of the UNIX Operating System. Briefly, the technique is to suspend interrupts temporarily while manipulating a data structure, and then to restore interrupts afterward. The BSD spl(9) man page has an example of this style of coding. Note that the interrupts are hardware-oriented, and the code embodies an implicit relationship between the kind of hardware interrupt and the data structures associated with that hardware. For example, code that manipulates disk I/O buffers needs to suspend interrupts from disk controller hardware while working with those buffers.
This style of programming was employed by operating systems on uniprocessor hardware. It was much rarer for applications to deal with interrupts. Some OSes had software interrupts, and I think people tried to build threading or coroutine systems on top of them, but this wasn't very widespread. (Certainly not in the UNIX world.) I suspect that interrupt-style programming is confined today to small embedded systems or real-time systems.
Semaphores are an advance over interrupts because they are software constructs (not related to hardware), they provide abstractions over hardware facilities, and they enable multithreading and multiprocessing. The main problem is that they are unstructured. The programmer is responsible for maintaining the relationship between each semaphore and the data structures it protects, globally across the entire program. For this reason I think bare semaphores are rarely used today.
Another small step forward is a monitor, which encapsulates concurrency control mechanisms (locks and conditions) with the data being protected. This was carried over into the Mesa (alternate link) system and from there into Java. (If you read this Mesa paper, you can see that Java's monitor locks and conditions are copied almost verbatim from Mesa.) Monitors are helpful in that a sufficiently careful and diligent programmer can write concurrent programs safely using only local reasoning about the code and data within the monitor.
There are additional library constructs, such as those in Java's java.util.concurrent
package, which includes a variety of highly concurrent data structures and thread pooling constructs. These can be combined with additional techniques such as thread confinement and effective immutability. See Java Concurrency In Practice by Goetz et. al. for further discussion. Unfortunately, many programmers are still rolling their own data structures with locks and conditions, when they really ought to just be using something like ConcurrentHashMap where the heavy lifting has already been done by the library authors.
Everything above shares some significant characteristics: they have multiple threads of control that interact over globally shared, mutable state. The problem is that programming in this style is still highly error-prone. It's quite easy for a small mistake to go unnoticed, resulting in misbehavior that is hard to reproduce and diagnose. It may be that no programmer is "sufficiently careful and diligent" to develop large systems in this fashion. At least, very few are. So, I'd say that multi-threaded programming with shared, mutable state should be avoided if at all possible.
Unfortunately it's not entirely clear whether it can be avoided in all cases. A lot of programming is still done in this fashion. It would be nice to see this supplanted by something else. Answers from Jarrod Roberson and davidk01 point to techniques such as immutable data, functional programming, STM, and message-passing. There is much to recommend them, and all are being actively developed. But I don't think they've fully replaced good old-fashioned shared mutable state just yet.
EDIT: here's my response to the specific questions at the end.
I don't know much about OpenMP. My impression is that it can be very effective for highly parallel problems such as numeric simulations. But it doesn't seem general-purpose. The semaphore constructs seem pretty low-level and require the programmer to maintain the relationship between semaphores and shared data structures, with all the problems I described above.
If you have a parallel algorithm that uses semaphores, I don't know of any general techniques to transform it. You might be able to refactor it into objects and then build some abstractions around it. But if you want to use something like message-passing, I think you really need to reconceptualize the entire problem.
Will my application magically see and use multiple cores when run on a
multi-core processor (because everything is managed either by the
operating system or by the standard thread library), or do I have to
modify my code to be aware of the multiple cores?
Simple answer: Yes, it will usually be managed by the operating system or threading library.
The threading subsystem in the operating system will assign threads to processors on a priority basis (your option 1). In other words, when a thread has finished executing for its time allocation or blocks, the scheduler looks for the next highest priority thread and assigns that to the CPU. The details vary from operating system to operating system.
That said, options 2 (managed by programming language) and 3 (explicitly) exist. For example, the Tasks library and async/await in recent versions of .Net give the developer a much easier way to write parallelizable (i.e. that can run concurrently with itself) code. Functional programming languages are innately parallelizable and some runtimes will run different parts of the program in parallel if possible.
As for option 3 (explicitly), Windows allows you to set the thread affinity (specifying which processors a thread can run on). However, this is usually unnecessary in all but the fastest, response-time critical systems. Effective thread to processor allocation is highly hardware dependent and is very sensitive to other applications running concurrently.
If you want to experiment, create a long running, CPU intensive task like generating a list of prime numbers or creating a Mandelbrot set. Now create two threads in your favorite library and run both threads on a multi-processor machine (in other words, just about anything released in the last few years). Both tasks should complete in roughly the same time because they are run in parallel.
Best Answer
If you use C++11, threading is part of standard library and components where it makes sense like smart pointers are thread-safe (collections generally require you lock them yourself).
If you are using Boost, have a look at boost.thread. It is base for what was standardized in C++11 (most new things in C++11 come from boost).