|
It is quite clear that multi-threaded programs are non trivial to design/implement without a certain skills and/or experience. Indeed, both the concept of parallelism behind and also running time considerations (architecture and also what has been done by the compiler!) have to be well understood. Due to the emergence of multi-cores machines (now become the standard), any programmer will end by deadling with this sort of parallel programming. Most of them will just jump to it after a quick look at some tutorials or using (and relying on) relevant frameworks and tools. At this point of the current note, I will refrain from devopping technical programming details. I would rather suggest some good starting point material
About timing
Well, the main thing we need to do in order to evaluate the impact of our multithread effort is measuring the execution time of our inner-concurrent program. This measure depend on the purpose of this information, which could be
- profiling for performance improving
- profiling for synchronization design
- estimating the total parallel execution time
- ...
Programmer are very keen (I can understand it) to see how much they have improved the execution time of the program, compare to its original (sequential) version. Sometime, this is done just to be convince that multithread is really rewarding (generally the first step for beginner or for people who do not really believe in it) or to make sure we are in a valid concurrency context (architecture, system, OS settings and/or restrictions, compiler capability, appropriate library/interface, available ressources,...). Whatever the case, we need to be aware of what we are really measuring. Keep in mind that the total execution time of a program (from the time it has been launched by the OS to the time it definitely completes) is a sum of different timings (not always exclusive)
- calculations
- IO (memory, disk, ...)
- inter-process communication
- user-synchronization overhead
- system-synchronization overhead (in case of ressource true/false sharing)
- processes/thread creations and releases
- voluntary/involuntary idle times
All the events related to the above timings are correlated. Thus, breaking this chain is the goal of a good parallel programming achievement. A common mistake when trying to measure the impact of a multithreaded code is to use a cpu-timing routine (cloc() in C or cpu_time() in matlab for instance)
program()t0 = clock(); launch thread1; waitfor thread1; launch thread2; waitfor thread2; t1 = clock(); total_time= (t1 - t0)/CLOCKS_PER_SEC; | end | program()t0 = clock(); launch thread1; launch thread2; waitfor thread1; waitfor thread2; t1 = clock(); total_time= (t1 - t0)/CLOCKS_PER_SEC; | end |
|
| Sequential execution of threads (1) | Parallel execution of threads (2) |
Assuming that we are running the same (logical) program in both cases, we should expect to have (2) running faster that (1) if executed on two CPUs or two COREs. However, the way we have measured our timing will display what will appear as a «surprising result». We will have for instance 1.256 seconds for (1) and 1.298 seconds for (2) (they will be more or less the same !!!). At this point, it is customary to blame our machine, compiler, OS, threads library, ..., one of them could be cause. Before suspecting the code or the system, first note that clock() is measuring the TOTAL cpu-time, which remains the same. It is like having two people doing the job of one, the global effort is at least the same. Moreover, clock() or whatever cpu-timing will not count idle time (sleep(n)) and system work. So, what you really want to (or should) measure is the wall clock time, (time(NULL), tic, toc , ...). However, using the two kind of measures could give some indication about blocking sequences within your code.
A good way to go deeper in timing and code inspection is to use specialized profiler tools like VTune or Intel Thread Profiler.
Sharing
I need to talk about sharing because of its direct impact on timings. Concurrent threads always share some ressources, and any sharing should be carefully handled by the programmer as this is one of the main source of performance/scalability degradation. Just imagine what happens when a tow-way road becomes one-way even on a very short portion and then become two-way again. There are two kinds of sharing. One is logical, in the sense that it comes from the program structure and is explicitly expressed by the programmer (mutual exclusions and/or synchronization statements). The other one is internal to the system and handled at running time (an intermediate skill in computer architecture is required to overcome this). The so-called false sharing belongs to the later, and is very hindering as it can cause the multi-threaded program running worse than sequentially (some suggested links [1], [2], [3], [4] ).
|