Multi core programming presents different challenges than traditional parallel computing. In this article we will explore a programming paradigm called ‘the dispatcher’ and its implementation in multi-core environment.
This post will present the subject and discuss design considerations, code examples will be presented in a later post.
Multi-Core vs. Multi Processor
Multi-core environments tend to be a bit different than multi-processor ones. Here are the two major processor-specific factors we address to in this article:
Cache – Multi core CPUs usually have smaller L1 cache, and the L2 cache is shared between cores. This calls for small and specialized code. Large code will cause a lot of instruction cache misses, and will degrade performance.
Bus – Inter core bus is an important factor. If the bus is slow, data transfer between cores will become a huge bottleneck.
The Dispatcher
The dispatcher model assumes that there is an incoming queue of messages (or tasks) that needs some processing. Each processing unit (from here on referred simply as ‘core’) can be used in parallel to the others. Usually in the end the message is transported to a location. Prime examples of the dispatcher model are packet processing devices (router, firewall etc.) and graphical processing units.
As an allegory we can see the dispatcher as a kitchen, where waiters bring in orders, and the kitchen team prepares dishes and delivers them back to the waiters for serving.
Pipeline vs. Run-to-Completion
There are two major models for the dispatcher. The pipeline model assigns a specific task to each processing unit, where in the run-to-completion model each processing unit handles a single message from start to end.
The Pipeline Model
We would like to implement a pipeline in our restaurant. We teach each person a specific task. One will be in charge of sauces, another on garnish etc. When an order arrives, the plate goes from one cook to another, each performing its relevant work. Finally the plate is returned to the counter to be transported by the waiter.
In the packet processing world we can use one core to routes packet, another to enforce an ACL (access list) etc.
Analysis
The pipeline model has several advantages:
- Specialization – Each processing unit specializes at a specific task. In the restaurant model this means we have to do less training, we train each person only with the appropriate task. In the multi core model this means that less code runs on each core. When the size of the code is small, it can be optimized to fit in the instruction cache, increasing overall performance.
- Flexibility – The pipeline model is very flexible, if we see that a certain task slows down the entire process, we can assign another core to that task. For example if all the dishes are waiting for garnish for a long time, we assign another person to garnish from another task.
Note: This is a very expensive task to do at runtime. It so must not be done too often or we lose other benefits (see strong affiliation below). - Shared Data Locks – This model usually needs only a few locks on shared data structures, since not all the cores access all the data structures.
The pipeline model has several limitations:
- Data duplication – Data should travel between different processing units. If the data is large, this will clog the bus, causing messages to wait on the bus most of the time.
An important rule of thumb in the pipeline model is to transfer as little data as possible – preferably just a pointer between the cores. - Strong affiliation – Since each task is assigned to a specific processing unit, it is said that the code is affiliated with this core. If we decide during runtime to change the task of a core, for a significant amount of time we lose all the instruction and data cache.
In the restaurant model this is similar to a person that is trained to cook fish, and now we need to train the person to prepare sauces.
- Message Locks – Since we need to exchange data between cores, if we write on the message itself, we will almost certainly need to lock the data in transport. This calls for multiple locks for each message.
- Robustness – What happens if one of the processing units gets stuck? If no control is done, this will cause the entire process to fail. If data gets corrupt in one core, this will cause the entire message to get corrupt. The pipeline therefore requires a strong watchdog to act when something goes wrong. See more on the control section below.
- Unfair Work Division – Let’s assume that in our kitchen one person is in charge for fish, and another for desserts. If there is a slow fish day, our fish person is mostly unemployed. In a multi-core system this means that some cores might be idling while the system runs on full load. This can be handled by dynamically allocating cores to task, but as explained earlier, with some penalty.
Control Point
The pipeline model requires a strong controlling process to make sure nothing goes wrong. The control will usually have a dedicated core to the task, or even an entire dedicated CPU for extra robustness (in case we need to restart an entire processor).
Some of the control point roles are:
- Message Handling Time Limit – Putting some upper limit to the amount of time a message can spend in a core is usually a good idea. This can help detect deadlocks and non uniform performance.
- Core Reassignment – A watchdog must be prepared to remove a core from the pipeline or changing its task. This helps dealing with major faults and fair task division.
- Command and Control Central Point – The control point is the central place where all control and configuration commands are processed. It is usually a bad idea giving direct user commands to processing cores. User commands can be errornous and cause system instability. The control point must assure the commands are safe, and track for complete execution of the control commands. In case the control command failed or caused system instability, the watchdog must re-stabilize the system and notify the user of the error.
The Run-to-Completion Model
Let’s assume that in our restaurant we chose a different model. Every person will handle a dish from beginning to end. Every person is well trained to do all the tasks that are involved, and from the moment an order arrives, that person prepares it with no interruptions until it’s finished.
This is very similar to the thread-pool model, but here we have a guarantee that a dedicated core runs from beginning to end uninterrupted.
Analysis
Let’s go over the advantages of the run-to-completion model:
- Independence – Every processing unit is independent, no data is transferred between cores, when there are no interruptions, it is easy to measure how much time each processing unit takes to complete the task and provide real-time assurances.
- Scalability – Adding more processing units is an easy task. Since all the cores are symmetric, adding a core to the game will just add another worker to the pool.
- Message Locks – The message does not need to be locked. From start to end it is accessed only by a single core.
The model has several shortcomings:
- Large code – All the processing units run all the tasks thus every processor needs to run a lot of code. If the instruction caching is not good enough, this will cause a lot of cache misses and performance penalty.
- Shared Data Locks – We will almost always need to share resources between all cores. When these resources are modified, they need to be locked, causing performance penalty.
Control Point
The control point in the run-to-completion model should perform similar tasks to the ones in the previous model. Controlling the cores is usually easier, since there is no difference between the cores and there are less scenarios to deal with.
Common Issues
- Core affiliation – In both modes it is imperative that code will run on the same core, to benefit from processor cache and better control on the process. If the controller does not knows that a specific task runs on core X it will have hard time tracking its status.
- The Input Queue – The design of the input queue has a major influence the overall process. The input queue is usually designed as a FIFO queue or a priority queue if QoS is required. The queue is a single entry point to the system, so it creates a natural bottleneck. Inefficient queue will limit the number of messages entering the system even before a single processing instruction occurred.
- Bus efficiency – Inter-core bus and I/O busses can create a bottleneck if the cores transfer data between them, once again, data should be moved as little as possible while processing. Fast busses can sometimes compensate for little or no data cache.
- Instruction and data cache – Each processing unit usually have far less cache than a full fledged CPU. Code that runs on each core should be optimized to get as many cache hits as possible, or performance will suffer. Measuring the performance of the instruction and data cache is an important system dependent factor.
- I/O and memory allocation – I/O operations and memory allocations are problematic in two factors.
First, it is obvious that if the processing unit spends a lot of time waiting for I/O and memory, it is idling.
A second but not less important factor is the real-time factor. I/O access and memory managers are not deterministic. We would like to be able to measure the amount of processing time for each message as accurately as possible, and I/O infer with our goal.
As usual it is recommended to pre-allocate all the memory required for the message processing, and avoid any I/O operations while processing a message.
Summary
When designing a complete system, you will probably need to mix-and-match the two methods to get best performance. Depending on implementation, some subsystems should have run-to-completion properties, while others should use the pipeline model as a whole.
It is a good idea to profile your requirements and split the work at hand to micro tasks. Once you defined all the tasks and inner relationships between them, a decision can be made.
If you are bound to a specif processor architecture and OS, it is imperative to research all the processor advantages and shortcomings to reach the best decision. On the other hand, on multi-platform systems you must decide and enforce a set of basic requirements, and be flexible on others you can’t control (like strong affiliation, real-time scheduler priority or fast locks, which are OS/hardware specific).
Take special notice to the control point. Do not satisfy designing the data data handling path. The control point is just as important!