NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Analytics: Data Analysis and Data Mining

Quick Links:  NERSC Tools for Data Analysis and Data Mining  |  Case Studies


Data analysis is a broad term whose meaning may vary from one project to another. It may encompass anything from simple post-processing of observed or simulated data to more advanced mathematical and statistical computations to machine learning and data mining algorithms.

This page provides information on tools and services provided by NERSC for a wide variety of data analysis needs. Most of these tools most likely would be used in conjunction with tools for data management and visualization.

This document is organized into the three categories listed below. Examples of collaborations between NERSC users and the Analytics Team are listed under Case Studies. Also available on a separate page: links to documentation and reference information for data analysis software that is available on NERSC platforms.

For questions or help with data analysis, contact the NERSC Analytics Team at consult@nersc.gov.




Eye of hurricane generated by high-resolution CCSM CAM climate simulation, showing anomalous values in multiple variables: sea level pressure, total precipitable water, and surface friction velocity. (Data courtesty of Michael Wehner.)

Data Analysis Categories

  • Data Processing: e.g., removing noise or corrupt data, converting formats, normalizing, aggregating, merging data from different sources, selecting data subsets.
  • Mathematical Computation: e.g., statistical tests, optimization, filtering,
  • Machine Learning and Data Mining: e.g., dimensionality reduction, classification, clustering, time series analysis, pattern recognition, outlier detection.


Data Processing

Data processing refers to general manipulation of data that is a necessary precursor to further analysis or visualization of the data. For common tasks like merging and splitting files or launching a series of command-line programs, scripting tools such as Python, Perl, Tcl/Tk are versatile languages that are easily interleaved with UNIX-style commands, may be integrated with C/C++ libraries, and run on a variety of platforms. As scripting languages, they enable quick implementation and debugging/testing without worrying about compilation or memory management.

If you wish to build a visual interface for your data analysis steps, Tcl/Tk includes an easy-to-learn, high-level toolkit for quickly building graphical user interfaces (GUI's). For more information on scripting languages, see Comparing Python to Other Languages, Perl Versus, and Dynamic Languages vs. System Programming Languages.

When it is advantageous to distribute processing over multiple nodes, the MPI message passing library may be invoked from Fortran, C, or C++ code to pass data between multiple nodes.

NERSC Supported Tools for Data Processing

     Perl      Perl at NERSC     Documentation     Tutorial     Further Reading
     Python      Python at NERSC     Documentation     Tutorial        
     Tcl/Tk      Tcl/Tk at NERSC     Documentation     Tutorial     Further Reading

Top of Page


Mathematical Computation

Depending on the type and scale of mathematical computation needed, available NERSC tools include low-level C/C++ libraries, downloadable packages for higher-level scripting languages, and packages that integrate graphics with built-in functions for numerical and/or symbolic mathematical computations.

NERSC provides a large number of math libraries that can be linked to user code. These libraries include both linear and parallel implementations of linear solvers, FFTs, PDE solvers, and basic linear algebra. In addition, the GNU Scientific Library (GSL) includes mathematical routines for linear algbebra, sorting, FFTs, linear solvers, random number generation, numerical differentiation, and least-squares fitting.

There are many Python packages available for advanced mathematical and scientific computations. The NumPy package for array and vector manipulation is installed on several NERSC platforms, and many other Python packages are freely available for download. To download packages not installed at the global level on NERSC machines, browse or search the Python Cheese Shop or the Vaults of Parnassus. Packages may be installed in a user's local directory and must be added to the user's PYTHONPATH environment variable in order to import them into the user's python code.

On some NERSC platforms, it is not necessary to use the module command to use Python. However, to access the NumPy module, it is necessary to load Python using the module command, e.g., module load python.

Among the high-level, interactive applications with both command-line and graphical interfaces are MATLAB, IDL, Maple, and Mathematica. MATLAB and IDL are procedural programming languages with a relatively simple syntax and include a wide variety of mathematical routines that can be used interactively, used as modules in interpreted or compiled programs, or integrated with external languages such as C, C++, Fortran, Java, and others. Maple and Mathematica differ in that they perform symbolic as well as numeric computations, so functional forms of expressions may be simplified, combined, and manipulated without numerical error, and later instantiated for numerical computations. Links to more information are provided at the end of this section. MATLAB is an interactive tool for linear algebra, statistics, Fourier analysis, filtering, optimization, and numerical integration. It includes 2D and 3D graphics functions for visualizing data, as well as tools for building custom GUI's. The NERSC installation includes the following MATLAB toolboxes:

  • Optimization Toolbox and Library for unconstrained and constrained minimization, quadratic and linear programming, nonlinear least-squares and curve fitting, solving nonlinear systems of equations, and sparse and structured large-scale problems;
  • PDE Toolbox for solving 2D PDEs using FEM, adaptive meshing, and boundary condition specification;
  • Splines Toolbox for B-spline package piecewise-polynomial function approximation;
  • Statistics Toolbox for computing and fitting probability distributions, analysis of variance, hyothesis testing, linear and nonlinear regression and parameter estimation; and
  • Signal Processing Toolbox for analog and digital signal processing, linear systems modeling, digital filter design, frequency-domain analysis, and spectral estimation.
Many additional MATLAB toolboxes are freely available at MATLAB's Central File Exchange. (See the next section for information on MATLAB's machine learning capabilities.)

IDL, together with its associated graphical interface IDLDE, is an interactive application used for data analysis, visualization, and cross-platform application development. It handles a variety of file formats, including many image formats, as well as netCDF, HDF, and HDF-EOS. IDL includes routines for curve and surface-fitting, differentiation and integration, linear algebra (including LAPACK and Numerical Recipes), optimization, hypothesis testing, correlation analysis, gridding and interpolation. It afeatures a comprehensive image and signal processing library and 3D graphics and rendering capabilities.

Maple, together with its associated graphical version xmaple, is a tool for the manipulation of symbolic-algebraic expressions, arbitrary-precision numerics, and 2D/3D graphics. It computes both symbolic and numeric solutions to linear algebraic and differential equations, and includes linear solvers, matrix factorization, vector calculus capabilities, and statistical analysis routines. In addition, abstract and combinatorial mathematical concepts may be easily represented, and a comprehensive graphical interface is built in for plotting, visualization, and creating final documents.

Mathematica performs symbolic manipulation of algebraic equations, integrals, differential equations and other mathematical expressions, as well as numeric evaluation. It represents all expressions symbolically, in order to perform error-free computations, and permits automatic or manual selection of algorithms to instantiate expressions and find numerical solutions. The NERSC installation includes the Combinatorica extension for combinatorics and graph theory. See the Mathematica FAQ for more information and a discussion on which types of computational tasks Mathematica is appropriate for. (Mathematica's machine learning functionality is described in the next section.)

For a comparative review of Maple, Mathematica, and MATLAB, see the Computing in Science &Engineering Technology Review, "3Ms for Instruction".

NERSC Supported Tools for Mathematical Computation

     GNU Scientific Library (GSL)      List of Routines     Documentation
     IDL      IDL at NERSC     Documentation
     Maple      Maple at NERSC     Documentation
     Mathematica      Mathematica at NERSC     Documentation
     MATLAB      MATLAB at NERSC     Documentation
     Python NumPy      Overview     Documentation
            See note in text above regarding how to access
the NumPy module on NERSC platforms.

Tools with Limited Support at NERSC

     Perl and Python are installed on many NERSC plattforms. Users may install Perl modules and Python packages locally. Perl modules are available from the Comprehensive Perl Archive Network CPAN Web site. The Python SciPy library was designed to work with NumPy arrays and provides routines for numerical integration and optimization. To find additional Python packages, browse by topic, e.g., 'Science/Engineering' or 'Mathematics', at the Python Cheese Shop or the Vaults of Parnassus.

IDL Extensions Not Available at NERSC

     IDL Wavelet Toolkit
     IDL Dataminer Option

Top of Page


Machine Learning and Data Mining

Many scientfic data analysis tasks, such as feature detection, may be modeled as machine learning problems, particularly when there is noise and uncertainty in the data or when the desired features are not easily described and hence cannot be detected by deterministic methods. A brief introduction to the types of problems and algorithms that fall into the category of machine learning and data mining is available on a separate page.

NERSC's MATLAB installation includes both the Statistics Toolbox and the Neural Network Toolbox for designing, implementing, and visualizing supervised and unsupervised neural networks, which are useful for pattern recognition, nonlinear system identification, and control. Many additional MATLAB implementations of a wide variety of machine learning methods are freely available at MATLAB's Central File Exchange (listed under the statistics and probability category).

Included in Mathematica's built-in Statistics Library is a Cluster Analysis tool that performs either hierarchical or k-means clustering with a variety of similarity functions.

While Python and Perl modules may be available for certain machine learning algorithms, it often may be easier to perform a general web search for software libraries in any common language that implement a particular algorithm you are interested in (e.g., search for "+k-means +clustering +code"). For most well-known algorithms, it is possible to freely download and use C/C++, R, or MATLAB code that can be called from a user's own code.

Also useful, though currently not available on NERSC platforms, is the open-source R Project, which provides a wide variety of statistical and graphical techniques (e.g., linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering). It is free and easy to install on multiple platforms, and there are numerous user-contributed R packages that implement mathematical and machine learning methods.

NERSC Supported Tools for Machine Learning and Data Mining

     Mathematica      Mathematica at NERSC     Documentation
     MATLAB      MATLAB at NERSC     Documentation

Tools with Limited Support at NERSC

     MATLAB is installed on many NERSC plattforms. Users may download MATLAB files from the Central File Exchange.

Not Available at NERSC

Further Reading


Case Studies

Top of Page


LBNL Home
Page last modified: Fri, 21 Mar 2008 18:02:38 GMT
Page URL: http://www.nersc.gov/nusers/analytics/analysis/
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science