## **Programming Models**

**Facilitators** 

Paul Henning (LANL) Sadaf Alam (CSCS) Jonathan Carter (LBL)

Technical Committee Report to the Hybrid Multicore Consortium

First HMC Roadmap Workshop January 19-22, San Francisco







### BREAKOUT PARTICIPANTS

- Sadaf Alam, CSCS, (alam@cscs.ch)
- Jonathan Carter, LBL, (<u>itcarter@lbl.gov</u>)
- Alan Coppola, OptNgn, (ajjc@optngn.com)
- Richard Graham, ORNL, (rlgraham@ornl.gov)
- Paul Henning, LANL (<u>phenning@lanl.gov</u>)
- Wen-mei Hwu, UIUC (hwu@crhc.uiuc.edu)
- John Levesque, ORNL, (<u>levesque@ornl.gov</u>)
- Al McPherson, LANL (mcpherson@lanl.gov)
- Piyush Mehrota, NASA Ames (piyush.mehrota@nasa.gov)
- Dave Norton, PGI, (dave.norton@pgroup.com)
- Philip Roth, ORNL, (<u>rothpc@ornl.gov</u>)
- Sonia Sachs (soniasachs@gmail.com)
- John Shalf, LBL, (jshalf@lbl.gov)
- Aniruddha Shet, ORNL, (<u>shetag@ornl.gov</u>)
- John Thorp, LANL (thorp@lanl.gov)
- Kathy Yelick, LBL, (<u>yelick@eecs.berkeley.edu</u>)
- Shujia Zhou, NASA, (shujia.zhou@nasa.gov)









## CHARGE TO BREAKOUT SESSIONS

- Goal of Roadmap:
  - Identify technologies that need to be developed to make next generation, large-scale, accelerator-based systems "production ready"
  - Provide community input needed to prioritize and support activities
- Focus is near term, while keeping an eye toward to long term (avoid box canyons)
- Work with the other TCs to support the overall co-design of applications, architectures, programming, and performance and to build ties with and provide feedback to vendors.
- Develop strategies for early and broader access to these accelerator-based or future hybrid multicore systems.









## CHARGE TO PROGRAMMING MODELS

- Identify and report on programming models for developing applications on large-scale (accelerator-based) hybrid computer systems in the near term and in the future.
- Identify the types and degrees of parallelism provided by hybrid cores and to define key architectural metrics of this class of hybrid machine.









# SUMMARY OF PROGRAMMING MODELS TC

- Areas of interest:
  - Code and performance portability
  - Developer productivity: tools, programming for "mere mortals"
  - Data layout & motion, multiple disjoint address spaces, SIMD length, etc.
- Relation to other TCs
  - Relation to applications: algorithm design/selection
  - Relation to architectures: design roadmaps
  - Relation to performance: data motion costs, system modeling









#### Review of Grading Criteria

| Urgency                                                | Duration                                                   | Responsive                                                                      | Applicability                                                         | Timeline                                        |
|--------------------------------------------------------|------------------------------------------------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------------------|-------------------------------------------------|
| <b>Critical</b><br>Needed as soon<br>as possible       | <b>Long</b><br>Applicable for<br>the foreseeable<br>future | <b>High</b><br>Additional<br>funding would<br>enable<br>significant<br>progress | <b>Broad</b><br>Applicable<br>beyond HPC                              | <b>Immediate</b><br>Results within<br>1-2 years |
| <b>Important</b><br>Needs to be done<br>within 3 years | <b>Medium</b><br>Will be<br>applicable for<br>Exascale     | <b>Moderate</b><br>Additional<br>funding would<br>enable progress               | <b>HPC</b><br>Applicable to all<br>of HPC                             | <b>Soon</b><br>Results within<br>2-5 years      |
| <b>Useful</b><br>Needed after 3<br>years               | <b>Near</b><br>Only applicable<br>for immediate<br>systems | <b>Low</b><br>Additional<br>funding will not<br>help very much                  | <b>Narrow</b><br>Only applicable<br>to Hybrid<br>Multicore<br>systems | <b>Eventually</b><br>Results after 5<br>years   |









# HMC Programming: Best Practices and Knowledge Transfer

- Description
  - Provide independent assessment of technologies.
  - Match algorithms to hardware.
  - Influence future investments
- Notes from Discussion
  - Reference implementations
  - Best practices
  - White papers & books
  - Benchmark suites
  - Illustrate range of available technology options

- Relations to other TCs
  - Applications: collaborate on design of architectureaware algorithms
  - Libraries: preserve best practices, but algorithms should be revisited!
  - Architecture: co-design
- Related Projects
  - CUDA Zone, motifs, MAGMA project

| Urgency   | Duration                             | Responsive          | Applicability       | Timeline  |
|-----------|--------------------------------------|---------------------|---------------------|-----------|
| Important | Medium                               | High                | Narrow (a<br>plus!) | Immediate |
|           | <b>CAK</b><br>Ribor<br>National Labo | Ratory BERKELEY LAB | • Los Alamos        |           |

### **Transition Tools**

- Description
  - Tools to facilitate refactoring existing code bases to new programming paradigms.
  - Tools for identifying acceleration opportunities.
  - Choosing the right hardware for the application.
- Notes from Discussion
  - Language interoperability is crucial

- Relations to other TCs
  - Applications: requirements
  - Performance: modeling of systems
- **Related Projects** 
  - Compiler directives (e.g. OpenMP)
  - Language translation (e.g. C-to-CUDA, C-to-FPGA)
  - Performance analysis & modeling tool extensions (e.g. ROSE, TAU)

| 10 01 00 | 141      |            |               |          |
|----------|----------|------------|---------------|----------|
| Urgency  | Duration | Responsive | Applicability | Timeline |
| Critical | Medium   | High       | HPC           | Soon     |
|          |          |            | 2             |          |

#### Debugging and Performance Support

- Description
  - Capability to access debugging and performance data on HMC hardware and runtime
  - Correlating data from heterogeneous hardware components
  - Bridging the semantic gap between low-level data and high-level programming models
- Notes from Discussion
  - Goal: Uniform interface between tools and architectural features for portability

- Relations to other TCs
  - Architecture: collaboration on two-way exchange of information on debugging and performance
  - Performance: analysis tools
- Related Projects
  - Consumers: NVIDIA Nexus, vampir, oprofile, TAU, TotalView, Allinea DDT, Charm++

• PAPI



#### HMC and Non-HMC Performance Portability

- Description
  - Single code base for performance on multiple architectures.
  - Addressing explicitlymanaged memory hierarchies
- Notes from Discussion
  - What are the implications of maintaining multiple code bases (V&V, feature creep, etc)
  - What breadth of application space?

- Relations to other TCs
  - Applications: what is "acceptable" performance, when needed?
  - Architecture: compatibility or general-purpose feature additions
- **Related Projects** 
  - MCUDA, OpenCL, CUDA-Fortran
  - Autotuning



#### **Expressive Programming Environments**

- Description
  - Reduce effort to utilize accelerator hardware
  - Capture developer's <u>intent</u> in a more declarative way, ' develop back-ends for HMC
- Notes from Discussion

- Relations to other TCs
  - Applications: co-design of declarative programming environments
  - **Related Projects** 
    - Thrust
    - MATLAB
    - Python (Copperhead, SciPy)
    - Domain specific languages
    - HPCS Languages
    - FPGA Workflow (LabVIEW, C2H, MATLAB-to-FPGA)

| Urgency | Duration                | Responsive | Applicability | Timeline   |
|---------|-------------------------|------------|---------------|------------|
| Useful  | Long                    | Moderate   | Broad         | Eventually |
|         | <b>VALUATIONAL LABO</b> |            | • Los Alamos  |            |

## BREAKOUT SUMMARY

| Торіс                                           | Urgency   | Duration | Responsive | Applicability | Timeline   |
|-------------------------------------------------|-----------|----------|------------|---------------|------------|
| HMC<br>Programming:<br>Best<br>Practices        | Important | Medium   | High       | Narrow        | Immediate  |
| Transition<br>Tools                             | Critical  | Medium   | High       | HPC           | Soon       |
| Debugging<br>and<br>Performance<br>Support      | Important | Long     | High       | Broad         | Soon       |
| HMC & non-<br>HMC<br>Performance<br>Portability | Important | Long     | Moderate   | Broad         | Eventually |
| Expressive<br>Programming<br>Environments       | Useful    | Long     | Moderate   | Broad         | Eventually |
|                                                 |           |          |            |               |            |





• Los Alamos

#### NOTES AND RECOMMENDATIONS

- Testbeds: a large variety of small systems to test crossplatform applicability
- Clusters: useful to evaluate programming models (e.g. PGAS), but only up to a point
- Stability of development and execution environments
- Cross-cutting collaboration is critical







