by Eugene Loh, Sun Microsystems
This note explores memory profiling, e.g., as discussed in an [e-mail thread|http://www.open-mpi.org/community/lists/users/2009/04/8920.php]. The user observed large (several-Gbyte) memory imbalances among processes in an MPI job and wondered if large amounts of memory were being allocated in some unexpected way by the user application or if that memory were being allocated by the MPI implementation in response to resource congestion. The user application sends work among its processes in irregular ways. The user hypothesized that one or more processes might be falling behind in its workload, causing large backlogs of in-coming messages and problems for the MPI implementation.
Here, we see how to confirm or disprove that hypothesis using Sun tools: [Sun Studio|http://developers.sun.com/sunstudio/] compilers and tools, as well as [Sun HPC ClusterTools|http://www.sun.com/software/products/clustertools/index.xml]. These tools are available for free download on Linux and Solaris systems and on x86 and SPARC processors.
For purposes of illustration, our *[sample program]* has rank 0 send a million short messages to rank 1, who receives them. The twist is that we can force the receiver to fall behind by having it sleep for a few seconds before starting to receive. This should mimic what might be happening in the actual user program, where the receiver might be falling far behind and producing a large backlog of sent-but-not-received messages. The program also allocates some memory in some function {{foo1()}}.
This *[run script]* shows how to run the program, asking Sun Studio to trace MPI messages and heap allocations, and how to invoke the GUI-based Performance Analyzer and its command-line equivalent {{er_print}}.
Here is information on the top routines.
{code}
receiver is ready receiver falls behind
------------------- -------------------
#bytes #allocs #bytes #allocs
MPI_Init 414444765 16251 414444765 16251
foo1 134217728 2 134217728 2
MPI_Send 6743210 563 492279586 58725
MPI_Recv 5325230 138 71696450 2385
MPI_Finalize 11598 153 11598 153
{code}
We can clearly distinguish between memory allocation by the user program and memory allocation within the MPI implementation. Most of the functions show the same allocation activity, regardless of whether the receiver is ready or falls behind. The memory allocation by the message-passing calls, however, skyrockets when the receiver falls behind, with most of the allocation on the send side. Notice that:
* There is a lot of memory being allocated by {{MPI_Init()}}. That is not our focus here, but let us briefly explain it.
\\
Using either {{er_print}} or the Analyzer, we can see that about a third of this memory is being allocated when {{MPI_Init()}} calls {{vt_open()}}. This is because we're using VampirTrace to trace MPI messages. If we run without message tracing, we can confirm that that memory is not being allocated.
\\
Almost all the rest is the large (128-Mbyte) shared memory area that is being used by the MPI implementation for messages sent between the two on-node processes. That area is being counted twice since it is mapped into each process address space.
* Another 128 Mbytes are being allocated by the user program in function {{foo1()}}.
* Most significantly, there is a huge increase in the amount of memory allocated during {{MPI_Send()}} and {{MPI_Recv()}} when the receiver falls behind the sender.
For this sample program, we have seen how to find our key answer: considerable memory is being allocated by the MPI implementation when a backlog of messages accumulates. This memory consumption is distinguishable from what the user program allocates.
The Performance Analyzer gives us additional insights into what's going on. Here, we see message lines on a timeline display as well as a plot of the number of in-flight messages as a function of elapsed time. When the receiver is ready, there is a steady backlog of about 1300 messages during the message-passing period. In contrast, when the receiver falls behind, the backlog quickly grows to 85,000 messages, with relief occurring only because the sender stalls periodically.
(*{_}You might have to widen the browser window{_}*.)
|| || receiver is ready || receiver falls behind ||
| {color:#ff6600}*timeline{*}{color} | !nosleep-timeline.gif! | !sleep-timeline.gif! |
| {color:#ff6600}{*}# of in-flight messages{*}{color} {color:#ff6600}{*}vs. elapsed time{*}{color} | !nosleep-nmsgs.gif! | !sleep-nmsgs.gif! |
The timeline views are the opening, default views when the MPI Performance Analyzer is started.
In contrast, we must generate the in-flight-message plots ourselves. Note that the MPI Performance Analyzer allows you to generate rather arbitrary plots with a handful of mouse clicks. We typically look at data in two dimensions. So you simply need to specify what X and Y axes you want and what you want to see as a function of X and Y. The tool will make reasonable choices for the things you don't specify: whether axes are continuous or discrete, how to use labels and colors, what the title should say, etc. The steps are:
* Go to the tabs above the timeline and click on *MPI Chart*:
\\
!tabs.gif!
* Then go to the right-hand chart controls.
\\
!chart-controls.gif!
\\
Here, specify that you want to look at:
** *Data Type: Messages*
** *Chart: X Histogram*. Because we asked for an "X Histogram", our dropdown list for Y axis is greyed out.
** *X Axis: Time (range)* The "Time (range)" axis is special. It means that whatever is being studied (function calls, messages, etc.) is shown for the entire time it existed. For example, if a message was sent at time t0 and received at time t1, its data is attributed to all times within that range.
** *Metric: 1* and *Operator: Sum*. The "Sum(1)" metric is simply a counting metric. In this case, as indicated by the chart title, we are simply looking at the "Number of Messages".
\\
Then, click on *Redraw* to generate one of the plots shown above.
This note explores memory profiling, e.g., as discussed in an [e-mail thread|http://www.open-mpi.org/community/lists/users/2009/04/8920.php]. The user observed large (several-Gbyte) memory imbalances among processes in an MPI job and wondered if large amounts of memory were being allocated in some unexpected way by the user application or if that memory were being allocated by the MPI implementation in response to resource congestion. The user application sends work among its processes in irregular ways. The user hypothesized that one or more processes might be falling behind in its workload, causing large backlogs of in-coming messages and problems for the MPI implementation.
Here, we see how to confirm or disprove that hypothesis using Sun tools: [Sun Studio|http://developers.sun.com/sunstudio/] compilers and tools, as well as [Sun HPC ClusterTools|http://www.sun.com/software/products/clustertools/index.xml]. These tools are available for free download on Linux and Solaris systems and on x86 and SPARC processors.
For purposes of illustration, our *[sample program]* has rank 0 send a million short messages to rank 1, who receives them. The twist is that we can force the receiver to fall behind by having it sleep for a few seconds before starting to receive. This should mimic what might be happening in the actual user program, where the receiver might be falling far behind and producing a large backlog of sent-but-not-received messages. The program also allocates some memory in some function {{foo1()}}.
This *[run script]* shows how to run the program, asking Sun Studio to trace MPI messages and heap allocations, and how to invoke the GUI-based Performance Analyzer and its command-line equivalent {{er_print}}.
Here is information on the top routines.
{code}
receiver is ready receiver falls behind
------------------- -------------------
#bytes #allocs #bytes #allocs
MPI_Init 414444765 16251 414444765 16251
foo1 134217728 2 134217728 2
MPI_Send 6743210 563 492279586 58725
MPI_Recv 5325230 138 71696450 2385
MPI_Finalize 11598 153 11598 153
{code}
We can clearly distinguish between memory allocation by the user program and memory allocation within the MPI implementation. Most of the functions show the same allocation activity, regardless of whether the receiver is ready or falls behind. The memory allocation by the message-passing calls, however, skyrockets when the receiver falls behind, with most of the allocation on the send side. Notice that:
* There is a lot of memory being allocated by {{MPI_Init()}}. That is not our focus here, but let us briefly explain it.
\\
Using either {{er_print}} or the Analyzer, we can see that about a third of this memory is being allocated when {{MPI_Init()}} calls {{vt_open()}}. This is because we're using VampirTrace to trace MPI messages. If we run without message tracing, we can confirm that that memory is not being allocated.
\\
Almost all the rest is the large (128-Mbyte) shared memory area that is being used by the MPI implementation for messages sent between the two on-node processes. That area is being counted twice since it is mapped into each process address space.
* Another 128 Mbytes are being allocated by the user program in function {{foo1()}}.
* Most significantly, there is a huge increase in the amount of memory allocated during {{MPI_Send()}} and {{MPI_Recv()}} when the receiver falls behind the sender.
For this sample program, we have seen how to find our key answer: considerable memory is being allocated by the MPI implementation when a backlog of messages accumulates. This memory consumption is distinguishable from what the user program allocates.
The Performance Analyzer gives us additional insights into what's going on. Here, we see message lines on a timeline display as well as a plot of the number of in-flight messages as a function of elapsed time. When the receiver is ready, there is a steady backlog of about 1300 messages during the message-passing period. In contrast, when the receiver falls behind, the backlog quickly grows to 85,000 messages, with relief occurring only because the sender stalls periodically.
(*{_}You might have to widen the browser window{_}*.)
|| || receiver is ready || receiver falls behind ||
| {color:#ff6600}*timeline{*}{color} | !nosleep-timeline.gif! | !sleep-timeline.gif! |
| {color:#ff6600}{*}# of in-flight messages{*}{color} {color:#ff6600}{*}vs. elapsed time{*}{color} | !nosleep-nmsgs.gif! | !sleep-nmsgs.gif! |
The timeline views are the opening, default views when the MPI Performance Analyzer is started.
In contrast, we must generate the in-flight-message plots ourselves. Note that the MPI Performance Analyzer allows you to generate rather arbitrary plots with a handful of mouse clicks. We typically look at data in two dimensions. So you simply need to specify what X and Y axes you want and what you want to see as a function of X and Y. The tool will make reasonable choices for the things you don't specify: whether axes are continuous or discrete, how to use labels and colors, what the title should say, etc. The steps are:
* Go to the tabs above the timeline and click on *MPI Chart*:
\\
!tabs.gif!
* Then go to the right-hand chart controls.
\\
!chart-controls.gif!
\\
Here, specify that you want to look at:
** *Data Type: Messages*
** *Chart: X Histogram*. Because we asked for an "X Histogram", our dropdown list for Y axis is greyed out.
** *X Axis: Time (range)* The "Time (range)" axis is special. It means that whatever is being studied (function calls, messages, etc.) is shown for the entire time it existed. For example, if a message was sent at time t0 and received at time t1, its data is attributed to all times within that range.
** *Metric: 1* and *Operator: Sum*. The "Sum(1)" metric is simply a counting metric. In this case, as indicated by the chart title, we are simply looking at the "Number of Messages".
\\
Then, click on *Redraw* to generate one of the plots shown above.