C++ – Running a process multiple times at the same time

cmultithreadingUbuntu

I have a c++ program with opencv library which takes an image as input and perform pose estimation,color detection,phog. When I run this program from the command line it takes around 4-5sec to complete. It takes around 60%cpu. When I try to run the same program from two different command line windows at the same time the process takes around 10-15 sec to finish and both the process finish in almost the same time. The CPU Usage reaches upto 100%.

I have a website which calls this c++ exe using exec() command. So when two users try to upload an image and run it takes more time as I explained above in the command line. Is this because the c++ program involves high computation and the CPU reaches 100% it slows down? But I read that the CPU reaching 100% is not a bad thing as the computer is using its full capacity to run the program. So is this because of my c++ program or is it something to do with my server(computer) settings? This is probably not the apache server problem because when I try to run it from the command line also it slows down. I am using a quad core processor and all the 4 CPU reaches 100% when I try to run the same process at the same time so I think that its distributed among all the processor. So I have few more questions:

1) Can this be solved by using multithreading in my c++ code?As for now I am not using it but will multithreading make the c++ code more computationally expensive and increase the CPU usage(if this is the problem).

2) What can be the reason of it slowing down? Is the process in a queue and each process is ran only a certain amount of time and it switches between the two process?

3) If this is because it involves high computation will it help if I change some functions to opencv gpu functions?

4) Is there a way I can solve this problems any ideas or tips?

I have inserted the result of top when running one process and running the same process twice at the same time:

Version5 is the process,running it once
enter image description here
Two Version5 running at the same time
enter image description here

Thanks in advance.

Best Answer

I wasn't surprised at the numbers you are reporting after I considered what the program seems to be doing. It is computationally intensive with a lot of memory data fetches. The file I/O would be slow in comparison and I don't sense that this is an issue (yet).

Looking at your numbers, I'm thinking you think that 4-5sec x Two should be 8-10 sec.? I' say realistically the best you could expect would be 9-12 sec., nothing is ideal. Using my 'mud math' we have 3 sec of CPU more than ~anticipated~. That's a lot of CPU.

I'd be targeting the memory access contention as my first point of investigation. Shared core CPU-s will use the same memory bus and access to memory. They also probably use the same on-chip cache(s) too. But two processes like this will want to both 'fill' the data cache. For example, P01 might so some calculations and use 50% of the data cache; then comes P02 who wants 50% of the resources too. But there won't be 50% available because the operating system needs some and there are other (important) processes like your server, etc.

Anyway the expected outcome is that P02 will evict a significant chunk of P01's resources. Meaning that P01 will find that it has to go out to the DRAM for some of the data it wants for subsequent calculations. If I'm at all close, you will get better linear performance by queuing your transformations and running them in sequence. You could test that in the command shell by putting P01 and P02 in a script:

P01
P02

If the elapsed time comes out under 12 sec. then there is some resource contention involved (not necessarily memory cache). In that case, I recommend using some kind of queue.

At present the (best case) elapsed time for Both clients is: 10sec. (each) because both are slowed down. If you queue, client one waits 5 sec and client two waits 10 sec. Client 2 is no worse off. Of course that is not really scalable. However a transformation queue is scalable if you can enlist one or more back-end machines process jobs from the queue.

Since the transformation is queued the user can do something interesting while the transformation occurs. That can hold-off a resource contention issue temporarily.

Your other job is to find out what resources is being thrashed. That may not be easy depending on your machine and low-level tools available. I'd also be checking on the source code algorithm and optimisations your C++ build uses for the graphics processing. Some "speed-ups" work fine for single threaded apps by turning off things that you might want to use in multi-threaded situations.

Since your machine is a server, it may not have a graphics CPU. It might be worth considering some specialised iron to process this stuff. It usually comes down to how much you want performance and how much you want to expend (time, thinking, money, resources) to get it.

Other things to look at are alternative processing libraries that might be faster or have more suitable configuration options. Can you re-structure your code to take better advantage of the library's way of doing things?

If you can identify one or more bottle-necks in your transformation code -- This can be done with a C++ profiler -- Then you may be able to run the individual transforms in a staggered fashion so that over all through-put is improved. That's a big "IF", but worth mentioning.

FIRST though, before committing more than a day more on this; check with your 'business process owner' and find out exactly what the "OK" number of seconds is. If they say 17-seconds, you are looking good, just queue things up and let the user get-on with something else on the page.

I'll be interested to know if the processes in series is quicker that 2 in parallel.

Good luck