Analytics: Data Analysis and Data Mining
Quick Links:
NERSC
Tools for Data Analysis and Data Mining | Case Studies
Data analysis is a broad term whose meaning may vary from one
project to another. It may encompass anything from simple
post-processing of observed or simulated data to more advanced
mathematical and statistical computations to machine learning
and data mining algorithms.
This page provides information on tools and services provided by
NERSC for a wide variety of data analysis needs. Most of these
tools most likely would be used in conjunction with tools for
data management and visualization.
This document is organized into the three
categories listed below.
Examples of collaborations between NERSC users and the Analytics Team
are listed under Case Studies.
Also available on a separate page: links to documentation and reference
information for
data
analysis software that is available on NERSC platforms.
For questions or help with data analysis, contact the
NERSC Analytics Team at
consult@nersc.gov.
|
Eye of hurricane generated by
high-resolution CCSM CAM climate simulation,
showing anomalous values in multiple variables:
sea level pressure,
total precipitable water, and surface friction velocity.
(Data courtesty of Michael Wehner.)
|
Data Analysis Categories
- Data Processing:
e.g., removing noise or corrupt data, converting
formats, normalizing, aggregating, merging data from different
sources, selecting data subsets.
- Mathematical Computation:
e.g., statistical tests, optimization, filtering,
- Machine Learning and Data Mining:
e.g., dimensionality reduction, classification, clustering, time
series analysis, pattern recognition, outlier detection.
Data Processing
Data processing refers to general manipulation of data that is a
necessary precursor to further analysis or visualization of the
data. For common tasks like merging and splitting files or launching
a series of command-line programs, scripting tools such as Python, Perl, Tcl/Tk are versatile languages that are
easily interleaved with UNIX-style commands, may be integrated with
C/C++ libraries, and run on a variety of platforms. As scripting
languages, they enable quick implementation and debugging/testing
without worrying about compilation or memory management.
If you wish to build a visual interface for your data analysis steps,
Tcl/Tk includes an easy-to-learn, high-level toolkit for quickly
building graphical user interfaces (GUI's). For more information on
scripting languages, see Comparing
Python to Other Languages, Perl Versus, and
Dynamic Languages
vs. System Programming Languages.
When it is advantageous to distribute processing over multiple nodes,
the MPI message passing library may be
invoked from Fortran, C, or C++ code to pass data between multiple
nodes.
NERSC Supported Tools for Data Processing
Top of Page
Mathematical Computation
Depending on the type and scale of mathematical computation needed,
available NERSC tools include low-level C/C++ libraries, downloadable packages
for higher-level scripting languages, and packages that integrate graphics
with built-in functions for numerical and/or symbolic mathematical
computations.
NERSC provides a large number of
math libraries
that can be linked to user code. These libraries include both linear and
parallel implementations of linear solvers, FFTs, PDE solvers, and
basic linear algebra. In addition, the
GNU Scientific Library
(GSL) includes mathematical routines for linear algbebra, sorting,
FFTs, linear solvers, random number generation, numerical
differentiation, and least-squares fitting.
There are many Python packages available for advanced
mathematical and scientific computations.
The NumPy package for array and vector
manipulation is installed on several NERSC platforms, and many
other Python packages are freely available for download. To download packages
not installed at the global level on NERSC machines, browse or search the Python
Cheese Shop or the Vaults
of Parnassus. Packages may be installed in a user's local
directory and must be added to the user's PYTHONPATH
environment variable in order to import them into the user's python code.
On some NERSC platforms, it is not necessary to use the module command
to use Python. However, to access the NumPy module,
it is necessary to
load Python using the module command, e.g., module load python.
Among the high-level, interactive applications with both command-line
and graphical interfaces are MATLAB, IDL, Maple,
and Mathematica.
MATLAB and IDL are procedural programming languages with a relatively
simple syntax and include a wide variety of mathematical routines that
can be used interactively, used as modules in interpreted or compiled
programs, or integrated with external languages such as C, C++,
Fortran, Java, and others. Maple and Mathematica differ in that they
perform symbolic as well as numeric computations, so functional forms
of expressions may be simplified, combined, and manipulated without
numerical error, and later instantiated for numerical
computations. Links to more information are provided at the end of
this section.
MATLAB is an
interactive tool for linear algebra, statistics, Fourier analysis,
filtering, optimization, and numerical integration. It includes 2D
and 3D graphics functions for visualizing data, as well as tools for
building custom GUI's. The NERSC installation includes the following MATLAB toolboxes:
- Optimization
Toolbox and Library for unconstrained and constrained
minimization, quadratic and linear programming, nonlinear
least-squares and curve fitting, solving nonlinear systems of
equations, and sparse and structured large-scale problems;
- PDE Toolbox for
solving 2D PDEs using FEM, adaptive meshing, and boundary condition
specification;
- Splines Toolbox
for B-spline package piecewise-polynomial function approximation;
- Statistics
Toolbox for computing and fitting probability distributions,
analysis of variance, hyothesis testing, linear and nonlinear
regression and parameter estimation; and
- Signal Processing
Toolbox for analog and digital signal processing, linear systems
modeling, digital filter design, frequency-domain analysis, and
spectral estimation.
Many additional MATLAB toolboxes are freely
available at MATLAB's
Central
File Exchange. (See the next section for information
on MATLAB's machine learning capabilities.)
IDL, together with its associated
graphical interface IDLDE, is an interactive application used for
data analysis, visualization, and cross-platform application development.
It handles a variety of
file
formats, including many image formats, as well as netCDF, HDF, and
HDF-EOS. IDL includes routines for curve and surface-fitting, differentiation
and integration, linear algebra (including LAPACK and Numerical Recipes),
optimization, hypothesis testing, correlation analysis, gridding and
interpolation. It afeatures a comprehensive image and signal
processing library and 3D graphics and rendering capabilities.
Maple,
together with its associated graphical version xmaple, is a tool for the
manipulation of symbolic-algebraic expressions, arbitrary-precision
numerics, and 2D/3D graphics. It computes both symbolic and numeric
solutions to linear algebraic and differential equations, and includes
linear solvers, matrix factorization, vector calculus capabilities,
and statistical analysis routines. In addition, abstract and
combinatorial mathematical concepts may be easily represented, and a
comprehensive graphical interface is built in for plotting,
visualization, and creating final documents.
Mathematica
performs symbolic manipulation of algebraic equations, integrals, differential
equations and other mathematical expressions, as well as numeric
evaluation. It represents all expressions symbolically, in order to
perform error-free computations, and permits automatic or manual
selection of algorithms to instantiate expressions and find numerical
solutions. The NERSC installation includes the Combinatorica extension for
combinatorics and graph theory. See the Mathematica FAQ
for more information and a discussion on which types of computational tasks
Mathematica is appropriate for. (Mathematica's machine
learning functionality is described in the next
section.)
For a comparative review of Maple, Mathematica, and MATLAB, see the Computing
in Science &Engineering Technology Review,
"3Ms for Instruction".
NERSC Supported Tools for Mathematical Computation
Tools with Limited Support at NERSC
|
Perl and Python are installed on many NERSC plattforms. Users may install Perl
modules and Python packages locally.
Perl modules are available from the Comprehensive Perl Archive Network
CPAN Web site.
The Python SciPy
library was designed to work with NumPy arrays and provides routines for numerical
integration and optimization.
To find additional Python packages, browse by topic, e.g., 'Science/Engineering'
or 'Mathematics', at the
Python Cheese Shop
or the Vaults of Parnassus.
|
IDL Extensions Not Available at NERSC
Top of Page
Machine Learning and Data Mining
Many scientfic data analysis tasks,
such as feature detection, may be modeled as machine learning problems,
particularly when there is noise and uncertainty in the data or when
the desired features are not easily described and hence cannot be
detected by deterministic methods.
A brief introduction to the types of problems and algorithms that fall into the
category of machine learning and data mining is available on a separate
page.
NERSC's MATLAB installation includes both the
Statistics
Toolbox and the Neural Network Toolbox for designing, implementing, and visualizing
supervised and unsupervised neural networks, which are useful for pattern
recognition, nonlinear system identification, and control.
Many additional MATLAB implementations of a wide variety of machine
learning methods are freely available at MATLAB's
Central
File Exchange (listed under the statistics and probability
category).
Included in Mathematica's built-in Statistics Library
is a Cluster Analysis
tool that performs either hierarchical or k-means clustering with a variety of
similarity functions.
While Python and Perl modules may be available for certain machine
learning algorithms, it often may be easier to perform a general web
search for software libraries in any common language that implement a
particular algorithm you are interested in (e.g., search for "+k-means
+clustering +code"). For most well-known algorithms, it is
possible to freely download and use C/C++, R, or MATLAB code that can
be called from a user's own code.
Also useful, though currently not available on NERSC platforms, is the
open-source R Project, which
provides a wide variety of statistical and graphical techniques
(e.g., linear and nonlinear modeling, classical statistical tests,
time-series analysis, classification, clustering). It is free and easy
to install on multiple platforms, and there are numerous user-contributed
R packages
that implement mathematical and machine learning methods.
NERSC Supported Tools for Machine Learning and Data Mining
Tools with Limited Support at NERSC
|
MATLAB is installed on many NERSC plattforms. Users may download
MATLAB files from the
Central File Exchange.
|
Not Available at NERSC
Further Reading
Case Studies
Top of Page
|