{anchor:ggvnp}
h1. A Tutorial For the Performance Analyzer's MPI Features
You can use the Performance Analyzer to examine Message Passing Interface (MPI) applications to answer the following questions:
* Would tuning the MPI code produce significant performance improvement?
* Is the MPI performance characterized by synchronization or data transfer?
* Does the program contain load imbalances?
* How long is one iteration of program execution?
* How long does it take for program performance to equilibrate?
* What are the message-passing patterns in program execution?
* Which are most important: long or short messages?
* Do processes that send messages synchronize with processes that receive messages?
The preceding list is too broad to address in this single tutorial. The goal of this tutorial is to guide you through some basic new features of the Performance Analyzer including the following:{anchor:ggvkw}
* The MPI Timeline which graphically displays the MPI activity that occurred during an application's run.
* The MPI Charts which generates scatter plots and histograms to visualize the performance data of MPI functions and MPI messages.
* The MPI data-zooming and data-filtering controls which you can use to broaden or narrow your view of the data in the MPI Timeline and MPI Charts.
The MPI Timeline presents the data from a run of the test program as a timeline. Initially, your view of the timeline encompasses the run from beginning to end with all MPI functions and MPI messages represented graphically in a condensed form. You'll learn how to expand this presentation and move down from a complete view to a tightly focused view which can be as granular as a single function. The MPI Timeline offers many different ways to zoom, pan, and examine the data, together with MPI Charts. The MPI Charts enable you to plot statistical data about the functions and messages in graphical charts, to help you see what is happening in the run.
The tutorial is designed to be followed from beginning to end to show you how to use the new MPI features, and covers the following topics:
* Setting Up for the Tutorial
* Collecting Data on the ring_c Example
* Opening the Experiment
* Navigating the MPI Timeline by Zooming and Panning
* Viewing Message Details
* Viewing Function Details and Application Source Code
* Filtering Data in the MPI Tabs
* Using the Filter Stack
* Using MPI Charts
* Varying the MPI Chart Controls
\\
{anchor:ggvlo}
h2. Setting Up for the Tutorial
The Performance Analyzer works with the ClusterTools 7 and ClusterTools 8 Early Access software which are advanced previews of upcoming HPC ClusterTools development. HPC ClusterTools is an integrated toolkit for creating and tuning MPI applications that run on high performance clusters of Sun systems. This tutorial explains how to use the Performance Analyzer on an example MPI application called {{ring_c}} which is included with the Sun HPC ClusterTools 8.0 software.
You must already have a cluster configured and functioning for this tutorial.
Follow the steps below to get started.
*1.* If you do not already have the Sun Studio Express July 2008 release installed, you can get it at [http://developers.sun.com/sunstudio/downloads/express.jsp].
Install the Express release according to the instructions.
*2.* Download the ClusterTools 8.0 Early Access 2 release at [http://www.sun.com/software/products/clustertools/early_access.xml].
*3.* Install the ClusterTools software as described in the Quick Installation Guide, which is available in the sun-hpc-ct-8.0-docs.tar.gz documentation tar-file on the Sun Download Center page.
*4.* Add the _SUNSTUDIO_INSTALLDIR_/bin directory and the _CLUSTERTOOLS_INSTALLDIR_/bin directory to your path.
*5.* Copy the _CLUSTERTOOLS_INSTALLDIR_/examples directory into a directory to which you have write access. This directory must be visible from all the cluster nodes.
*6.* Change directory to your newly copied {{examples}} directory.
*7.* Type *make* to build the example.
{code}
make ring_c
mpicc -g -o ring_c ring_c.c
{code}
The program is compiled with the {{\-g}} option which allows the Performance-Analyzer data-collector to map MPI events to source code.
The {{ring_c}} program simply passes a message from process to process in a ring, then terminates.
*8.* Run the {{ring_c}} example with {{mpirun}} to make sure it works correctly.
This example shows how to run the program on a two-node cluster; each node handles up to 32 threads. The node names are specified in a host file, along with the number of slots that are to be used on each node. We have chosen to use 25 processes, and specify one slot on each host. You should specify a number of processes and slots that is appropriate for your system. See the {{mpirun}}(1) man page for more information about specifying hosts and slots. Note that you can also run this command on a standalone host that isn't part of a cluster, but the results might be less educational.
The host file for this example is called {{clusterhosts}} and contains the following content:
{code}
hostA slots=1
hostB slots=1
{code}
You must have permission to use a remote shell (ssh/rsh) to each host without logging into the hosts. By default, {{mpirun}} uses ssh.
{code}
% mpirun -np 25 --hostfile clusterhosts ring_c
Process 0 sending 10 to 1, tag 201 (25 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
Process 8 exiting
Process 9 exiting
Process 10 exiting
Process 11 exiting
Process 12 exiting
Process 13 exiting
Process 14 exiting
Process 15 exiting
Process 16 exiting
Process 17 exiting
Process 18 exiting
Process 19 exiting
Process 20 exiting
Process 21 exiting
Process 22 exiting
Process 23 exiting
Process 24 exiting
{code}
\\
\\
Run this command and if you get similar output, you are ready to collect data on an example application as shown in the next section.
\\
If you have problems with {{mpirun}} specifying {{ssh}}, try using the option {{\-mca plm_rsh_agent rsh}} to the {{mpirun}} command to specify the {{rsh}} command:
{code}
% mpirun -np 25 --hostfile clusterhosts -mca plm_rsh_agent rsh -- ring_c
{code}
\\
h2. Collecting Data on the {{ring_c}} Example
*1.* {{cd}} to the directory where your example binaries and source code are located. This directory must be visible from all the cluster nodes.
*2.* Run the following command:
{code}
% collect -M on -p off mpirun -np 25 --hostfile clusterhosts -- ring_c
{code}
\\
The {{\-M on}} option indicates that the collect is running on an MPI program, and the {{\-p off}} option turns off clock-based profiling to simplify the data collection. See the collect(1) man page for more information. The collect command might take a few moments to run and the output should be the same as the test run through the {{mpirun}} command.
The {{\-np 25}} option specifies 25 processes on the cluster, and {{\--hostfile clusterhosts}} indicates that the node names and the number of slots that are to be used on each node are specified in a host file called {{clusterhosts}}. We have chosen to use 25 processes on two hosts, and specify one slot on each host. You should specify a number of processes and slots that is appropriate for your system.
*3.* List the contents of the newly created {{test.1.er}} directory and make sure the date on the files reflects the latest execution. This means you ran the command successfully and are ready to run the Performance Analyzer on {{ring_c}}. The integer in {{test.1.er}} increments for each _collect_ command you run so the rest of this tutorial refers to this name generically as {{test.*.er}}.
{anchor:gentextid-179}
h2. Opening the Experiment
*1.* {{cd}} to the directory which contains the {{ring_c.c}} source file, the {{ring_c}} executable, and the {{test.*.er}} directory.
*2.* Start the Performance Analyzer from the command line:
{code}
% <SUNSTUDIO_INSTALLDIR>/bin/analyzer
{code}
\\
The Performance Analyzer opens a file browser for you to find and open an experiment. If not, choose File > Open Experiment.
*3.* Find the {{test.*.er}} experiment that you just created and open it.
The Performance Analyzer window should look similar to that below.
!1-ring_timeline.gif|alt="Performance Analyzer window with MPI tabs"!
The experiment opens on the MPI Timeline tab. The MPI Chart tab is next to it. In the right panel you can see the MPI Chart Controls and MPI Timeline Controls tabs.
{info:title=Note - }If you do not see the MPI Timeline and MPI Chart tabs, check for an old {{.er.rc}} file from a previous Sun Studio release in your home directory. If such a file is present, delete it and restart the Performance Analyzer.
{info}
The MPI Timeline shows a view of the data over time as the program was run through the collector. The horizontal axis shows elapsed time. At the bottom, the horizontal axis shows "relative time" with the origin at the left edge of the display. At the top, the horizontal axis shows "absolute time" where the origin is the start of the data. The vertical axis shows MPI process rank. Therefore, for each MPI process you can look horizontally to see what the process is doing as a function of elapsed time.
In this initial view of the timeline, you can answer the question "What is the time scale of program execution?" In the screen capture, you can see that it is about 5 seconds, but only from 3.90 to 4.05 is actual run time, the steady state of the application program. The {{collect}} tool uses {{MPI_Init}} and {{MPI_Finalize}} to set up and terminate data collection.
h2. Navigating the MPI Timeline By Zooming and Panning
*1.* Click the MPI Timeline tab if it's not already selected.
*2.* Zoom in on the data by clicking and dragging from the left to the right on any process row as shown by the directional arrow in the graphic below. When you release the mouse button, the area inside the box automatically expands to reveal a zoomed in view.
!1.zoomdrag.gif!
An alternative to clicking and dragging is to use the zooming slider controls in the top left of the timeline:
!MPI_timeline_zoom_controls.gif|alt="Zooming controls in the MPI Timeline"!
Use the *horizontal slider* to change the time scale. You'll see progressively smaller chunks of time, while still showing all the processes, as you zoom.
Use the *vertical slider* to zoom in on the MPI processes.
Click the zoom-undo button in the MPI Timeline Controls shown below to go back to the previous level of zooming:
!zoom_undo.gif!
Click the zoom-undo button a second time to return to the first zoom.
*3.* Pan across the data by sliding the scroll bars located at the bottom and the right of the timeline.
Alternatively you can toggle between a pointer that zooms and a pointer that pans by clicking the hand icon in the MPI Timeline Controls:
!pan-zoom-toggle.gif|alt="Zoom/pan toggle button"!
When the pointer is a hand, you can drag across the MPI Timeline to pan horizontally.
h2. Viewing Message Details
*1.* Reset the view to the original, maximum, zoomed-out setting by clicking the zoom-reset button, which is located to the top left of the zoom sliders:
!MPI_timeline_zoom_controls_reset_edge.gif|alt="Zoom Reset button"!
*2.* Zoom in on the activity area by dragging on the area horizontally with the mouse so it looks similar to what you see here:
!2-ring-timeline-zoom.gif|alt="MPI Timeline zoomed in"!
In the zoomed in timeline, now you can see that the steady state portion of the program execution appears to be from 3.93 seconds to 4.03 seconds.
You can also see that MPI functions are color coded. The black lines drawn between events represent point-to-point messages exchanged by the MPI processes.
With this view of the timeline, you can answer the question: "How long is one iteration before the pattern repeats?" The answer is roughly 10 milliseconds. Look at the relative time scale at the bottom to see how often the loop seems to repeat.
*3.* Click one of the black message lines.
The line turns red and details about the message are displayed in the right-hand panel MPI Timeline Controls tab:
!4.details.gif!
*4.* In the MPI Timeline Controls tab, find the Messages slider, then click and drag it to 0% as shown:
!5-ring-timeline-message-slider.gif|alt="MPI Timeline with Message Slider"!
The Messages slider controls the number of message lines displayed on the screen. At 0%, only functions are displayed in the MPI Timeline. In this simple example, 100% of the messages can be displayed. However, in complex applications, if all messages were displayed, the volume of messages could be very high, overwhelming the tool with large data volume and making the screen too cluttered to be usable. Select a lower percentage of messages to reduce the volume of messages shown in the timeline. The tool adjusts default levels for the message volume so the screen is readable and the tool remains responsive. If fewer than 100% of the messages are shown, the messages used are those messages that are most "costly" in terms of the total time used in the message's send and receive functions.
*5.* Set the Messages slider back to 100%.
h2. Viewing Function Details and Application Source Code
*1.* Click on one of the {{MPI_Recv}} function events in the MPI Timeline.
The function is highlighted in yellow, and details about the function are displayed in the MPI Timeline Controls tab on the right:
!6.highlight.gif!
*2.* In the MPI Timeline Controls tab, click the button labeled *Show Call Stack if available*.
After a few moments, the call stack for the highlighted state should be shown in the MPI Timeline Controls tab:
!6.5-ring-callstack.gif|alt="MPI Timeline with Call Stack for Selected Event"!
*3.* When the Call Stack for Selected Event is displayed in the MPI Timeline Controls tab, click on {{main + 0x00000198, line 53 in "ring_c.c}}
*4.* Click the Source tab in the main Performance Analyzer panel.
If you get a message such as "Object file (unknown) not readable", make sure you selected the stack frame {{main + 0x00000198, line 53 in "ring_c.c}}.
{info:title=Note - }Source is only visible when the source is in the same location it was in when the program was run through the collector, or when it can be found in the {{$expts}} path as set in {{.er.rc}} or in View > Set Data Presentation > Search Path. Source also needs to be compiled with {{\-g}}.
If the source code is not visible, you may not have started Analyzer from the directory containing the {{ring_c}} binary and source code. If this is the case, quit the Performance Analyzer and restart after you {{cd}} to the directory containing {{ring_c}}.
{info}
When the source becomes visible, you should see the following:
!7.marked.gif!
The source should show where {{main()}} calls {{MPI_Recv()}}. As you can see, {{MPI_Recv()}} is called from line 53 in the source. The green bar highlights metrics with high values. 274 receives are associated with line 53. If you look further down, you can see 274 sends are associated with {{MPI_Send}} on line 60.
*5.* Click the Functions tab in the main Performance Analyzer panel.
The Functions tab shows the same MPI Send and MPI Receive metrics in columns on the left side of a table. You can sort the table by clicking in the column headers.
!8.marked.gif|alt="Functions tab with MPI metrics"!
*6.* Click the MPI Timeline tab to return to the MPI timeline.
Do not click the regular Timeline tab because it does not apply to MPI programs.
h2. Filtering Data in the MPI Tabs
The filtering facility lets you select different views of the collected messaging data. You can undo and redo the filters using the filtering controls in either the MPI Chart Controls or the MPI Timeline Controls.
!filter_controls__timeline_labels.gif|alt="Filtering Controls"!
The first control filters the data by removing everything that is not currently in view.
The second control is the Filter Undo button which provides an associated drop down list for removing filters. Clicking this button removes the last filter applied. Clicking the down arrow presents a list of the filters applied, in the order they were applied, with the most recent at the top of the list. When you select a filter in this list, the selected filter and all filters above it on the list are removed.
The third control is the Filter Redo button, and it also has an associated drop down list for reapplying filters. Clicking the button reapplies the last filter that you removed. Clicking the down arrow opens a list of all the filters that have been removed, in the order in which they were removed. When you select a filter in this list, the selected filter and all filters above it on the list are reapplied.
You can redo and undo the filters by using the arrows, similar to going backward and forward in a web browser. You can even remove and apply more than one filter in one click by using the down arrows next to the filter buttons.
The following steps explain how to use a filter to focus on the steady state portion of the program by filtering out the {{MPI_Init}} and {{MPI_Finalize}} functions.
*1.* Zoom in on the area of absolute time t=3.93 to 4.03 by dragging as show below:
!9a.drag.gif!
*2.* Click the filter button in the MPI Timeline Controls:
!filter_icon.gif|alt="Filter button in MPI Timeline Controls"!
It may look like nothing happened because the filtering is not evident until you change your view by zooming out or by looking at a chart.
*3.* Click the Zoom Undo button to go back out to the previous zoom.
!zoom_undo.gif!
The display now shows {{Uninstrumented}} in place of the {{MPI_Init}} and {{MPI_Finalize}} functions. White areas labeled as {{Uninstrumented}} indicate that there is no MPI data collected for that area or the data has been filtered out.
!10-ring-timeline-uninstrumented.gif|alt="Filtered Data Shown as Uninstrumented in MPI Timeline"!
h2. Using the Filter Stack
*1.* Drag vertically until you have zoomed-in far enough to see a single MPI_Send process. You may have to first drag horizontally to zoom in close enough to see some MPI_Send processes.
!30b.gif!
*2.* Click the filter button one time to filter out all data except {{MPI_Send}} data.
!filter_icon.gif|alt="Filter button"!
*3.* Click the MPI Timeline tab again and click the zoom reset button.
!MPI_timeline_zoom_controls_reset_edge.gif!\\
The MPI Timeline might appear to show everything as Uninstrumented, but there is a hidden {{MPI_Send}} function.
!31b.gif!
*4.* There's always at least one transition where the label of a function starts, so zoom-in on the beginning of the Uninstrumented states on the right side of the timeline until you see the hidden {{MPI_Send}} state.
!32b.gif!
Now suppose you want to go back and undo some of the filtering you have done.
*5.* Click the filter Undo drop down button to reveal a list of applied filters.
!filter_controls_undo_arrow.gif!
This list lets you choose which filters to remove. It works like a stack: if you select {{No filters applied}}, everything on top of it will be taken off which means there will be no filters applied. You should see something like the following in the list of filters:
| {{Timeline(Time(range)3398546253123,350489981242),Process(0,24)}} |
| {{No filters applied}} |
*6.* Select the top filter from the filter undo dropdown list ({{Timeline(Time(range)...)}}
The timeline should now look similar to the following:
!33b.gif!
*7.* Reset the zoom to confirm that your original filter is still in effect.
!MPI_timeline_zoom_controls_reset_edge.gif!
The timeline should look similar to the following:
!reset_zoom_after_undo_filter.gif!
h2. Using MPI Charts
Now you can explore the MPI Chart features with your filtered data. There are two types of data that you can view in chart form: Functions and Messages. In the following chart, we'll get an overview of which functions took the most time.
*1.* Click the MPI Chart tab to see a chart similar to the following.
!11.marked.gif!
The MPI Chart tab opens with a chart which shows the sum of the durations of the functions as they ran in all the processes. The vertical colored-scale, to the right of the chart, shows a scale of 0.01 seconds to 2.89 seconds. The MPI_Send and Application functions took almost no time whereas the MPI_Recv function took the full 2.89 seconds.
*2.* Click on the red bar for the MPI_Recv function.
The exact value of the red bar is displayed in the MPI Chart Controls tab. 2.892717584 seconds were spent in this function across all process ranks.
*3.* Click near the {{Application}} and {{MPI_Send}} chart bars to see their values.
In this particular application, every process waits until the token has passed to every other process. As a result 2.89 seconds were spent in {{MPI_Recv}} and only 0.03 seconds in {{Application}}, a state that represents time between MPI functions. All processes are waiting an equal amount, but any delays in the delivery of the token affects the whole application.
h2. Varying the MPI Chart Controls
This section shows how to use the MPI Chart controls in different ways to visualize the data. Depending on the program you are analyzing, some forms of charting are more useful than others. In this particular program, {{ring_c}}, we are focusing on message latency.
See [MPIAnalyzer:MPI Chart Controls and Attributes] for information about the chart attributes you can set.
\\
h3. Make a chart that shows where messages are being sent
*1.* Create a chart to look at messages by making the following selections in the MPI Chart Controls tab:
| Data Type: | Messages |
| Chart: | 2-D Chart |
| X Axis: | Send Process |
| Y Axis: | Receive Process |
| Metric: | Duration |
| Operator: | Maximum |
*2.* Click Redraw to draw a new chart:
!13b.highlight.gif!
This chart shows that Process 0 sends only to Process 1. Process 1 only sends to Process 2, and so on. The color of each box is set by the metric selected (Duration) and the operator (Maximum). Since this graph's Data Type is Messages, this will be the sum of duration of the messages, or the length of message lines in the time dimension.
Interestingly, the color key shows the range of message durations is from 0.3 msec to 9.7 msec. The messages that took the longest to arrive were sent from P14 to P15.
\\
*3.* Click on the square at Send Process = P14 and Receive Process = P15.
The details in MPI Chart Controls tab show that 9.724002 msec was spent sending messages from P14 to P15.
\\
h3. Make a chart to show which ranks waited longest to receive a message
*1.* Make the following selections in the MPI Chart Controls tab:
| Data Type: | Messages |
| Chart: | Y Histogram |
| Y Axis: | Receive Process |
| Metric: | Duration |
| Operator: | Maximum |
*2.* Click Redraw to draw a new chart:
!14b.gif!
The chart above shows that the P15 rank waited the longest to receive a message, at 7.67 msecs.
Other processes with lengthy waits are P13, P7, and P5.
\\
*3.* To show when and where these large delays occurred, select a 2-D chart with time range on the X Axis:
| Data Type: | Messages |
| Chart: | 2-D Chart |
| X Axis: | Time range |
| Y Axis: | Receive Process |
| Metric: | Duration |
| Operator: | Maximum |
\\
*4.* Click Redraw.
!16c.gif!
For the P15 rank, click the red line and check the details in the right panel. You can see that the delay
occurred at 3.980501280 seconds.
*5.* To show a histogram for when these long duration messages occurred:
| Data Type: | Messages |
| Chart: | X Histogram |
| X Axis: | Receive Time |
| Metric: | Duration |
| Operator: | Maximum |
*6.* Click Redraw.
!17b.gif!
The slowest message was received at 3.981 seconds.
h3. Look for an effect of the slow messages on time spent in MPI functions
To see the effect of the long duration messages, we will create a graph that shows duration of functions vs time.
*1.* Make the following selections in the MPI Chart Controls tab and click Redraw:
| Data Type: | Functions |
| Chart: | 2-D Chart |
| X Axis: | Exit Time |
| Y Axis: | Duration |
| Metric: | Duration |
| Operator: | Maximum |
The resulting graph shows clear regions of functions of long duration, especially for some functions ending at around t=3.995, which last 20.69 seconds.
!19b.gif!
*2.* Isolate these long duration functions by dragging a box around them to zoom:
!19b.zoomdrag.gif!\\
*3.* Click the filter button.
The resulting image shows two dots near 3.995 and duration 20.7 msec:
!20b.filter.gif!
*4.* Click the MPI Timeline tab.
You can now identify the high duration functions on the MPI Timeline. They are the result of messages with slow delivery times.
!21b.gif!
This simple example showed the basics of how to examine the relationships between MPI functions and messages.
This concludes the basic tutorial for the Performance Analyzer's MPI features.
h1. A Tutorial For the Performance Analyzer's MPI Features
You can use the Performance Analyzer to examine Message Passing Interface (MPI) applications to answer the following questions:
* Would tuning the MPI code produce significant performance improvement?
* Is the MPI performance characterized by synchronization or data transfer?
* Does the program contain load imbalances?
* How long is one iteration of program execution?
* How long does it take for program performance to equilibrate?
* What are the message-passing patterns in program execution?
* Which are most important: long or short messages?
* Do processes that send messages synchronize with processes that receive messages?
The preceding list is too broad to address in this single tutorial. The goal of this tutorial is to guide you through some basic new features of the Performance Analyzer including the following:{anchor:ggvkw}
* The MPI Timeline which graphically displays the MPI activity that occurred during an application's run.
* The MPI Charts which generates scatter plots and histograms to visualize the performance data of MPI functions and MPI messages.
* The MPI data-zooming and data-filtering controls which you can use to broaden or narrow your view of the data in the MPI Timeline and MPI Charts.
The MPI Timeline presents the data from a run of the test program as a timeline. Initially, your view of the timeline encompasses the run from beginning to end with all MPI functions and MPI messages represented graphically in a condensed form. You'll learn how to expand this presentation and move down from a complete view to a tightly focused view which can be as granular as a single function. The MPI Timeline offers many different ways to zoom, pan, and examine the data, together with MPI Charts. The MPI Charts enable you to plot statistical data about the functions and messages in graphical charts, to help you see what is happening in the run.
The tutorial is designed to be followed from beginning to end to show you how to use the new MPI features, and covers the following topics:
* Setting Up for the Tutorial
* Collecting Data on the ring_c Example
* Opening the Experiment
* Navigating the MPI Timeline by Zooming and Panning
* Viewing Message Details
* Viewing Function Details and Application Source Code
* Filtering Data in the MPI Tabs
* Using the Filter Stack
* Using MPI Charts
* Varying the MPI Chart Controls
\\
{anchor:ggvlo}
h2. Setting Up for the Tutorial
The Performance Analyzer works with the ClusterTools 7 and ClusterTools 8 Early Access software which are advanced previews of upcoming HPC ClusterTools development. HPC ClusterTools is an integrated toolkit for creating and tuning MPI applications that run on high performance clusters of Sun systems. This tutorial explains how to use the Performance Analyzer on an example MPI application called {{ring_c}} which is included with the Sun HPC ClusterTools 8.0 software.
You must already have a cluster configured and functioning for this tutorial.
Follow the steps below to get started.
*1.* If you do not already have the Sun Studio Express July 2008 release installed, you can get it at [http://developers.sun.com/sunstudio/downloads/express.jsp].
Install the Express release according to the instructions.
*2.* Download the ClusterTools 8.0 Early Access 2 release at [http://www.sun.com/software/products/clustertools/early_access.xml].
*3.* Install the ClusterTools software as described in the Quick Installation Guide, which is available in the sun-hpc-ct-8.0-docs.tar.gz documentation tar-file on the Sun Download Center page.
*4.* Add the _SUNSTUDIO_INSTALLDIR_/bin directory and the _CLUSTERTOOLS_INSTALLDIR_/bin directory to your path.
*5.* Copy the _CLUSTERTOOLS_INSTALLDIR_/examples directory into a directory to which you have write access. This directory must be visible from all the cluster nodes.
*6.* Change directory to your newly copied {{examples}} directory.
*7.* Type *make* to build the example.
{code}
make ring_c
mpicc -g -o ring_c ring_c.c
{code}
The program is compiled with the {{\-g}} option which allows the Performance-Analyzer data-collector to map MPI events to source code.
The {{ring_c}} program simply passes a message from process to process in a ring, then terminates.
*8.* Run the {{ring_c}} example with {{mpirun}} to make sure it works correctly.
This example shows how to run the program on a two-node cluster; each node handles up to 32 threads. The node names are specified in a host file, along with the number of slots that are to be used on each node. We have chosen to use 25 processes, and specify one slot on each host. You should specify a number of processes and slots that is appropriate for your system. See the {{mpirun}}(1) man page for more information about specifying hosts and slots. Note that you can also run this command on a standalone host that isn't part of a cluster, but the results might be less educational.
The host file for this example is called {{clusterhosts}} and contains the following content:
{code}
hostA slots=1
hostB slots=1
{code}
You must have permission to use a remote shell (ssh/rsh) to each host without logging into the hosts. By default, {{mpirun}} uses ssh.
{code}
% mpirun -np 25 --hostfile clusterhosts ring_c
Process 0 sending 10 to 1, tag 201 (25 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
Process 8 exiting
Process 9 exiting
Process 10 exiting
Process 11 exiting
Process 12 exiting
Process 13 exiting
Process 14 exiting
Process 15 exiting
Process 16 exiting
Process 17 exiting
Process 18 exiting
Process 19 exiting
Process 20 exiting
Process 21 exiting
Process 22 exiting
Process 23 exiting
Process 24 exiting
{code}
\\
\\
Run this command and if you get similar output, you are ready to collect data on an example application as shown in the next section.
\\
If you have problems with {{mpirun}} specifying {{ssh}}, try using the option {{\-mca plm_rsh_agent rsh}} to the {{mpirun}} command to specify the {{rsh}} command:
{code}
% mpirun -np 25 --hostfile clusterhosts -mca plm_rsh_agent rsh -- ring_c
{code}
\\
h2. Collecting Data on the {{ring_c}} Example
*1.* {{cd}} to the directory where your example binaries and source code are located. This directory must be visible from all the cluster nodes.
*2.* Run the following command:
{code}
% collect -M on -p off mpirun -np 25 --hostfile clusterhosts -- ring_c
{code}
\\
The {{\-M on}} option indicates that the collect is running on an MPI program, and the {{\-p off}} option turns off clock-based profiling to simplify the data collection. See the collect(1) man page for more information. The collect command might take a few moments to run and the output should be the same as the test run through the {{mpirun}} command.
The {{\-np 25}} option specifies 25 processes on the cluster, and {{\--hostfile clusterhosts}} indicates that the node names and the number of slots that are to be used on each node are specified in a host file called {{clusterhosts}}. We have chosen to use 25 processes on two hosts, and specify one slot on each host. You should specify a number of processes and slots that is appropriate for your system.
*3.* List the contents of the newly created {{test.1.er}} directory and make sure the date on the files reflects the latest execution. This means you ran the command successfully and are ready to run the Performance Analyzer on {{ring_c}}. The integer in {{test.1.er}} increments for each _collect_ command you run so the rest of this tutorial refers to this name generically as {{test.*.er}}.
{anchor:gentextid-179}
h2. Opening the Experiment
*1.* {{cd}} to the directory which contains the {{ring_c.c}} source file, the {{ring_c}} executable, and the {{test.*.er}} directory.
*2.* Start the Performance Analyzer from the command line:
{code}
% <SUNSTUDIO_INSTALLDIR>/bin/analyzer
{code}
\\
The Performance Analyzer opens a file browser for you to find and open an experiment. If not, choose File > Open Experiment.
*3.* Find the {{test.*.er}} experiment that you just created and open it.
The Performance Analyzer window should look similar to that below.
!1-ring_timeline.gif|alt="Performance Analyzer window with MPI tabs"!
The experiment opens on the MPI Timeline tab. The MPI Chart tab is next to it. In the right panel you can see the MPI Chart Controls and MPI Timeline Controls tabs.
{info:title=Note - }If you do not see the MPI Timeline and MPI Chart tabs, check for an old {{.er.rc}} file from a previous Sun Studio release in your home directory. If such a file is present, delete it and restart the Performance Analyzer.
{info}
The MPI Timeline shows a view of the data over time as the program was run through the collector. The horizontal axis shows elapsed time. At the bottom, the horizontal axis shows "relative time" with the origin at the left edge of the display. At the top, the horizontal axis shows "absolute time" where the origin is the start of the data. The vertical axis shows MPI process rank. Therefore, for each MPI process you can look horizontally to see what the process is doing as a function of elapsed time.
In this initial view of the timeline, you can answer the question "What is the time scale of program execution?" In the screen capture, you can see that it is about 5 seconds, but only from 3.90 to 4.05 is actual run time, the steady state of the application program. The {{collect}} tool uses {{MPI_Init}} and {{MPI_Finalize}} to set up and terminate data collection.
h2. Navigating the MPI Timeline By Zooming and Panning
*1.* Click the MPI Timeline tab if it's not already selected.
*2.* Zoom in on the data by clicking and dragging from the left to the right on any process row as shown by the directional arrow in the graphic below. When you release the mouse button, the area inside the box automatically expands to reveal a zoomed in view.
!1.zoomdrag.gif!
An alternative to clicking and dragging is to use the zooming slider controls in the top left of the timeline:
!MPI_timeline_zoom_controls.gif|alt="Zooming controls in the MPI Timeline"!
Use the *horizontal slider* to change the time scale. You'll see progressively smaller chunks of time, while still showing all the processes, as you zoom.
Use the *vertical slider* to zoom in on the MPI processes.
Click the zoom-undo button in the MPI Timeline Controls shown below to go back to the previous level of zooming:
!zoom_undo.gif!
Click the zoom-undo button a second time to return to the first zoom.
*3.* Pan across the data by sliding the scroll bars located at the bottom and the right of the timeline.
Alternatively you can toggle between a pointer that zooms and a pointer that pans by clicking the hand icon in the MPI Timeline Controls:
!pan-zoom-toggle.gif|alt="Zoom/pan toggle button"!
When the pointer is a hand, you can drag across the MPI Timeline to pan horizontally.
h2. Viewing Message Details
*1.* Reset the view to the original, maximum, zoomed-out setting by clicking the zoom-reset button, which is located to the top left of the zoom sliders:
!MPI_timeline_zoom_controls_reset_edge.gif|alt="Zoom Reset button"!
*2.* Zoom in on the activity area by dragging on the area horizontally with the mouse so it looks similar to what you see here:
!2-ring-timeline-zoom.gif|alt="MPI Timeline zoomed in"!
In the zoomed in timeline, now you can see that the steady state portion of the program execution appears to be from 3.93 seconds to 4.03 seconds.
You can also see that MPI functions are color coded. The black lines drawn between events represent point-to-point messages exchanged by the MPI processes.
With this view of the timeline, you can answer the question: "How long is one iteration before the pattern repeats?" The answer is roughly 10 milliseconds. Look at the relative time scale at the bottom to see how often the loop seems to repeat.
*3.* Click one of the black message lines.
The line turns red and details about the message are displayed in the right-hand panel MPI Timeline Controls tab:
!4.details.gif!
*4.* In the MPI Timeline Controls tab, find the Messages slider, then click and drag it to 0% as shown:
!5-ring-timeline-message-slider.gif|alt="MPI Timeline with Message Slider"!
The Messages slider controls the number of message lines displayed on the screen. At 0%, only functions are displayed in the MPI Timeline. In this simple example, 100% of the messages can be displayed. However, in complex applications, if all messages were displayed, the volume of messages could be very high, overwhelming the tool with large data volume and making the screen too cluttered to be usable. Select a lower percentage of messages to reduce the volume of messages shown in the timeline. The tool adjusts default levels for the message volume so the screen is readable and the tool remains responsive. If fewer than 100% of the messages are shown, the messages used are those messages that are most "costly" in terms of the total time used in the message's send and receive functions.
*5.* Set the Messages slider back to 100%.
h2. Viewing Function Details and Application Source Code
*1.* Click on one of the {{MPI_Recv}} function events in the MPI Timeline.
The function is highlighted in yellow, and details about the function are displayed in the MPI Timeline Controls tab on the right:
!6.highlight.gif!
*2.* In the MPI Timeline Controls tab, click the button labeled *Show Call Stack if available*.
After a few moments, the call stack for the highlighted state should be shown in the MPI Timeline Controls tab:
!6.5-ring-callstack.gif|alt="MPI Timeline with Call Stack for Selected Event"!
*3.* When the Call Stack for Selected Event is displayed in the MPI Timeline Controls tab, click on {{main + 0x00000198, line 53 in "ring_c.c}}
*4.* Click the Source tab in the main Performance Analyzer panel.
If you get a message such as "Object file (unknown) not readable", make sure you selected the stack frame {{main + 0x00000198, line 53 in "ring_c.c}}.
{info:title=Note - }Source is only visible when the source is in the same location it was in when the program was run through the collector, or when it can be found in the {{$expts}} path as set in {{.er.rc}} or in View > Set Data Presentation > Search Path. Source also needs to be compiled with {{\-g}}.
If the source code is not visible, you may not have started Analyzer from the directory containing the {{ring_c}} binary and source code. If this is the case, quit the Performance Analyzer and restart after you {{cd}} to the directory containing {{ring_c}}.
{info}
When the source becomes visible, you should see the following:
!7.marked.gif!
The source should show where {{main()}} calls {{MPI_Recv()}}. As you can see, {{MPI_Recv()}} is called from line 53 in the source. The green bar highlights metrics with high values. 274 receives are associated with line 53. If you look further down, you can see 274 sends are associated with {{MPI_Send}} on line 60.
*5.* Click the Functions tab in the main Performance Analyzer panel.
The Functions tab shows the same MPI Send and MPI Receive metrics in columns on the left side of a table. You can sort the table by clicking in the column headers.
!8.marked.gif|alt="Functions tab with MPI metrics"!
*6.* Click the MPI Timeline tab to return to the MPI timeline.
Do not click the regular Timeline tab because it does not apply to MPI programs.
h2. Filtering Data in the MPI Tabs
The filtering facility lets you select different views of the collected messaging data. You can undo and redo the filters using the filtering controls in either the MPI Chart Controls or the MPI Timeline Controls.
!filter_controls__timeline_labels.gif|alt="Filtering Controls"!
The first control filters the data by removing everything that is not currently in view.
The second control is the Filter Undo button which provides an associated drop down list for removing filters. Clicking this button removes the last filter applied. Clicking the down arrow presents a list of the filters applied, in the order they were applied, with the most recent at the top of the list. When you select a filter in this list, the selected filter and all filters above it on the list are removed.
The third control is the Filter Redo button, and it also has an associated drop down list for reapplying filters. Clicking the button reapplies the last filter that you removed. Clicking the down arrow opens a list of all the filters that have been removed, in the order in which they were removed. When you select a filter in this list, the selected filter and all filters above it on the list are reapplied.
You can redo and undo the filters by using the arrows, similar to going backward and forward in a web browser. You can even remove and apply more than one filter in one click by using the down arrows next to the filter buttons.
The following steps explain how to use a filter to focus on the steady state portion of the program by filtering out the {{MPI_Init}} and {{MPI_Finalize}} functions.
*1.* Zoom in on the area of absolute time t=3.93 to 4.03 by dragging as show below:
!9a.drag.gif!
*2.* Click the filter button in the MPI Timeline Controls:
!filter_icon.gif|alt="Filter button in MPI Timeline Controls"!
It may look like nothing happened because the filtering is not evident until you change your view by zooming out or by looking at a chart.
*3.* Click the Zoom Undo button to go back out to the previous zoom.
!zoom_undo.gif!
The display now shows {{Uninstrumented}} in place of the {{MPI_Init}} and {{MPI_Finalize}} functions. White areas labeled as {{Uninstrumented}} indicate that there is no MPI data collected for that area or the data has been filtered out.
!10-ring-timeline-uninstrumented.gif|alt="Filtered Data Shown as Uninstrumented in MPI Timeline"!
h2. Using the Filter Stack
*1.* Drag vertically until you have zoomed-in far enough to see a single MPI_Send process. You may have to first drag horizontally to zoom in close enough to see some MPI_Send processes.
!30b.gif!
*2.* Click the filter button one time to filter out all data except {{MPI_Send}} data.
!filter_icon.gif|alt="Filter button"!
*3.* Click the MPI Timeline tab again and click the zoom reset button.
!MPI_timeline_zoom_controls_reset_edge.gif!\\
The MPI Timeline might appear to show everything as Uninstrumented, but there is a hidden {{MPI_Send}} function.
!31b.gif!
*4.* There's always at least one transition where the label of a function starts, so zoom-in on the beginning of the Uninstrumented states on the right side of the timeline until you see the hidden {{MPI_Send}} state.
!32b.gif!
Now suppose you want to go back and undo some of the filtering you have done.
*5.* Click the filter Undo drop down button to reveal a list of applied filters.
!filter_controls_undo_arrow.gif!
This list lets you choose which filters to remove. It works like a stack: if you select {{No filters applied}}, everything on top of it will be taken off which means there will be no filters applied. You should see something like the following in the list of filters:
| {{Timeline(Time(range)3398546253123,350489981242),Process(0,24)}} |
| {{No filters applied}} |
*6.* Select the top filter from the filter undo dropdown list ({{Timeline(Time(range)...)}}
The timeline should now look similar to the following:
!33b.gif!
*7.* Reset the zoom to confirm that your original filter is still in effect.
!MPI_timeline_zoom_controls_reset_edge.gif!
The timeline should look similar to the following:
!reset_zoom_after_undo_filter.gif!
h2. Using MPI Charts
Now you can explore the MPI Chart features with your filtered data. There are two types of data that you can view in chart form: Functions and Messages. In the following chart, we'll get an overview of which functions took the most time.
*1.* Click the MPI Chart tab to see a chart similar to the following.
!11.marked.gif!
The MPI Chart tab opens with a chart which shows the sum of the durations of the functions as they ran in all the processes. The vertical colored-scale, to the right of the chart, shows a scale of 0.01 seconds to 2.89 seconds. The MPI_Send and Application functions took almost no time whereas the MPI_Recv function took the full 2.89 seconds.
*2.* Click on the red bar for the MPI_Recv function.
The exact value of the red bar is displayed in the MPI Chart Controls tab. 2.892717584 seconds were spent in this function across all process ranks.
*3.* Click near the {{Application}} and {{MPI_Send}} chart bars to see their values.
In this particular application, every process waits until the token has passed to every other process. As a result 2.89 seconds were spent in {{MPI_Recv}} and only 0.03 seconds in {{Application}}, a state that represents time between MPI functions. All processes are waiting an equal amount, but any delays in the delivery of the token affects the whole application.
h2. Varying the MPI Chart Controls
This section shows how to use the MPI Chart controls in different ways to visualize the data. Depending on the program you are analyzing, some forms of charting are more useful than others. In this particular program, {{ring_c}}, we are focusing on message latency.
See [MPIAnalyzer:MPI Chart Controls and Attributes] for information about the chart attributes you can set.
\\
h3. Make a chart that shows where messages are being sent
*1.* Create a chart to look at messages by making the following selections in the MPI Chart Controls tab:
| Data Type: | Messages |
| Chart: | 2-D Chart |
| X Axis: | Send Process |
| Y Axis: | Receive Process |
| Metric: | Duration |
| Operator: | Maximum |
*2.* Click Redraw to draw a new chart:
!13b.highlight.gif!
This chart shows that Process 0 sends only to Process 1. Process 1 only sends to Process 2, and so on. The color of each box is set by the metric selected (Duration) and the operator (Maximum). Since this graph's Data Type is Messages, this will be the sum of duration of the messages, or the length of message lines in the time dimension.
Interestingly, the color key shows the range of message durations is from 0.3 msec to 9.7 msec. The messages that took the longest to arrive were sent from P14 to P15.
\\
*3.* Click on the square at Send Process = P14 and Receive Process = P15.
The details in MPI Chart Controls tab show that 9.724002 msec was spent sending messages from P14 to P15.
\\
h3. Make a chart to show which ranks waited longest to receive a message
*1.* Make the following selections in the MPI Chart Controls tab:
| Data Type: | Messages |
| Chart: | Y Histogram |
| Y Axis: | Receive Process |
| Metric: | Duration |
| Operator: | Maximum |
*2.* Click Redraw to draw a new chart:
!14b.gif!
The chart above shows that the P15 rank waited the longest to receive a message, at 7.67 msecs.
Other processes with lengthy waits are P13, P7, and P5.
\\
*3.* To show when and where these large delays occurred, select a 2-D chart with time range on the X Axis:
| Data Type: | Messages |
| Chart: | 2-D Chart |
| X Axis: | Time range |
| Y Axis: | Receive Process |
| Metric: | Duration |
| Operator: | Maximum |
\\
*4.* Click Redraw.
!16c.gif!
For the P15 rank, click the red line and check the details in the right panel. You can see that the delay
occurred at 3.980501280 seconds.
*5.* To show a histogram for when these long duration messages occurred:
| Data Type: | Messages |
| Chart: | X Histogram |
| X Axis: | Receive Time |
| Metric: | Duration |
| Operator: | Maximum |
*6.* Click Redraw.
!17b.gif!
The slowest message was received at 3.981 seconds.
h3. Look for an effect of the slow messages on time spent in MPI functions
To see the effect of the long duration messages, we will create a graph that shows duration of functions vs time.
*1.* Make the following selections in the MPI Chart Controls tab and click Redraw:
| Data Type: | Functions |
| Chart: | 2-D Chart |
| X Axis: | Exit Time |
| Y Axis: | Duration |
| Metric: | Duration |
| Operator: | Maximum |
The resulting graph shows clear regions of functions of long duration, especially for some functions ending at around t=3.995, which last 20.69 seconds.
!19b.gif!
*2.* Isolate these long duration functions by dragging a box around them to zoom:
!19b.zoomdrag.gif!\\
*3.* Click the filter button.
The resulting image shows two dots near 3.995 and duration 20.7 msec:
!20b.filter.gif!
*4.* Click the MPI Timeline tab.
You can now identify the high duration functions on the MPI Timeline. They are the result of messages with slow delivery times.
!21b.gif!
This simple example showed the basics of how to examine the relationships between MPI functions and messages.
This concludes the basic tutorial for the Performance Analyzer's MPI features.