Project Title:
Automated Information Mining of Large Software Collections for the Extraction of Reusable Code
Perigee West Company
P.O. Box 1292
La Jolla, CA 92038-1292
93-1 06.10 9738__ AMOUNT REQUESTED $69,743
Automated Information Mining of Large Software Collections for the Extraction of Reusable Code
Abstract:
There exist today countless gigabytes of source code of widely
varying quality scattered across many corporate, government and
acadamic ftp sites throughout the world. THrough the INTERNET a
skilled programmer can peruse an ftp site in France, for example,
as easily as a system in the next room. User friendly interfaces
such as GOPHER and WAIS reduce the skills required for navigating
the INTERNET to find such resources. Huge collections of tar'd and
compressed source code can be acquired from acress the world in
just moments of data transfer. Commercially available CDROM sources
provide hundreds of megabytes of source code for as little as $35.
Identifying and extracting reusable code modules for incorporation
into an existing code library system from such enormous collections
can be a daunting task. We are proposing the development of a
Software Information Mining Tool which can analyze large
collections of software utilizing vector-space and latent demantic
analysis approaches to text retrieval. Unlike previouslu proposed
systems, the proposed tool will focus on the problem of latent
semantic analysis of free-form text embedded in the source code
itself, augmented by a heuristic analysis of the structure of the
collection.
A commercial tool for automating the identification, classification
and extraction of reusable source code from large existing
collections, which are world wide in distribution, will result is
significant cost savings in development of new software systems.
Our market research has indicated a significant commercial market
for the proposed capability.
software re-use information retrieval latent semantic analysis