Skip to content
master
Go to file
Code

README.md

Introduction

The University of Nebraska-Lincoln's (UNL) Aida digital libraries research team and the Library of Congress (LC) collaborated on a "summer of machine learning" in 2019 to explore machine learning techniques for extending the accessibility of digital collections. The UNL team developed a number of prototype explorations over multiple iterations to investigate a range of questions and issues related to the digital materials, the LC's collections, and to machine learning practices in cultural heritage organizations. The UNL team employed a variety of machine learning approaches such as back-propagation neural network-based classifiers and deep learning approaches, including convolutional neural networks. More specifically, these projects involve VGG16, ResNeXt, dhSegment, and a fusion network combining ResNeXt and U-Net.

This repository includes the code developed and used across the team's explorations.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

For Exploration - Document Segmentation, the required software systems and libraries are:

  • Anaconda >= 4.3
  • Python >= 3.6
  • TensorFlow 1.13
  • CUDA 10.0 [if training on GPU]
  • imageio >= 2.5
  • pandas >= 0.24.2
  • shapely >= 1.6.4
  • scikit-learn >= 0.20.3
  • scikit-image >= 0.15.0
  • opencv-python >= 4.0.1
  • tqdm >= 4.31.1
  • sacred 0.7.4
  • requests >= 2.21.0
  • click >= 7.0

For Exploration - Graphic Element Classification and Text Extraction and Exploration - Digitization Type Differentiation, the required software systems and libraries are:

  • Python 3.7
  • MXNet 1.5
  • CUDA 10.0 [if training on GPU]
  • Matplotlib 3.1.1
  • opencv-python 4.1
  • numpy 1.17

For Exploration - Document Type Classification, the required software systems and libraries are:

  • Anaconda >= 4.3
  • Python >= 3.6
  • TensorFlow 1.13
  • CUDA 10.0 [if training on GPU]
  • opencv-python >= 4.0.1
  • numpy >= 1.16.2
  • scikit-learn >= 0.20.3
  • scikit-image >= 0.15.0
  • matplotlib >= 1.4.3
  • pandas >= 0.24.2
  • seaborn 0.9.0

For Exploration - Document Image Quality Assessment, the required software systems and libraries are:

  • Python 3.7
  • scipy 1.3.1
  • opencv-python 4.1
  • skimage 0.15

Installing

Step-by-step instructions on how to install required software systems and libraries for each project

For Exploration - Document Segmentation

  1. Download Python 3.6 from https://www.python.org/downloads/
  2. Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
  3. Install Anaconda or Miniconda (installation procedure)
  4. Open Terminal (for MacOS), Command-Line (for Windows)
  5. Go to the codebase/Exploration - Document Segmentation folder
  6. Create a virtual environment and activate it
conda create -n segmentation python=3.6
source activate segmentation
  1. Install packages
python setup.py install

For Exploration - Graphic Element Classification and Text Extraction and Exploration - Digitization Type Differentiation

  1. Download Python 3.7 from https://www.python.org/downloads/
  2. Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
  3. Install downloaded installation file
  4. Open Terminal (for macOS), Command-Line (for Windows)
  5. Install MXNet
pip install 'mxnet-cu100==1.5.1'
  1. Install Matplotlib
python -m pip install -U 'matplotlib==3.1.1'
  1. Install opencv-python
pip install 'opencv-python==4.1'
  1. Install numpy
pip install 'numpy==1.17'

For Exploration - Document Type Classification

  1. Download Python 3.6 from https://www.python.org/downloads/
  2. Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
  3. Install Anaconda or Miniconda (installation procedure)
  4. Open Terminal (for MacOS), Command-Line (for Windows)
  5. Go to the codebase/Exploration - Digitization Type Differentiation folder
  6. Create a virtual environment and activate it
conda create -n classification python=3.6
source activate classification
  1. Install packages
python setup.py install

For Exploration - Document Image Quality Assessment

  1. Download Python 3.7 from https://www.python.org/downloads/
  2. Install downloaded installation file
  3. Open Terminal (for macOS), Command-Line (for Windows)
  4. Install scipy
pip install 'scipy==1.3.1'
  1. Install opencv-python
pip install 'opencv-python==4.1'
  1. Install skimage
pip install 'scikit-image==0.15'

Running the demonstrations

Exploration - Document Segmentation:

  1. Download all files in demo/Exploration - Document Segmentation https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Segmentation
  2. Install required softwares and libraries
  3. Download one of the following dataset: (1) https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/ENP_500 or (2) https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/difficulty_collection, for segmentation or clustering task, respectively
  4. Copy the downlaoded folder to the downloaded 'Exploration - Document Segmentation' folder
  5. Run one of the following command, depending on the purpose
# Activate virtual environment
source activate segmentation
# For segmentation task
python demo_segmentation.py
# For clustering task
python demo_clustering.py

Exploration - Graphic Element Classification and Text Extraction:

  1. Download all files in demo/Exploration - Graphic Element Classification and Text Extraction https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Graphic%20Element%20Classification%20and%20Text%20Extraction
  2. Install the required software and libraries
  3. Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/BeyondWord_orginal_resolution
  4. Copy the downloaded folder to the downloaded 'Exploration - Graphic Element Classification and Text Extraction' folder
  5. Run the evaluation script
python eval.py

Exploration - Document Type Classification:

  1. Download all files in demo/Exploration - Document Type Classification https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Type%20Classification
  2. Install required softwares and libraries
  3. Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/suffrage_1002
  4. Copy the downlaoded folder to the downloaded 'Exploration - Document Type Classification' folder
  5. Run the demonstration script
# Activate virtual environment
source activate Exploration - Document Type Classification
# Run demonstration
python demo_classification.py

Exploration - Digitization Type Differentiation:

  1. Download all files in [demo/Exploration - Digitization Type Differentiation] https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Digitization%20Type%20Differentiation
  2. Install the required software and libraries
  3. Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/micrpfilm_scanning
  4. Copy the downloaded folder to the downloaded 'Exploration - Digitization Type Differentiation' folder
  5. Run the evaluation script
python eval.py

Breaking down into end-to-end tests

Please read the README file inside each project folder for a description of each end-to-end test.

Built With

  • Python - The programming language
  • CUDA Toolkit - Enable GPU for model training
  • MXNet - Deep learning framework
  • TensorFLow - Deep learning framework
  • Matlab - Math, graphics, programming platform

Contributing

--- Inputs Technique Output Reports
Exploration - Document Segmentation (segmentation) ENP_500 (European historical newspaper)
Beyond Words
U-Net 5 class pixel-level segmented image Progress report - Chulwoo Pack - 07312019.pdf
Progress report - Chulwoo Pack - 08052019.pdf
Exploration - Document Clustering (clustering) ENP_500 (European historical newspaper) t-SNE Clustered manifold Progress report - Chulwoo Pack - 09232019.pdf
Exploration - Graphic Element Classification and Text Extraction ENP_500 (European historical newspaper)
Beyond Words
U-NeXt Predicted region segmentation Progress report - Yi Liu - 07302019.pdf
Exploration - Document Type Classification suffrage_1002 (LoC Suffrage campaign) U-Net Type of document image: handwritten, typed, and mixed Progress report - Chulwoo Pack - 08132919
Progress report - Chulwoo Pack - 08202019.pdf
Exploration - Document Image Quality Assessment Civil War Campaign DIQA Four quality scores Progress report - Yi Liu - 08122019.pdf
Progress report - Yi Liu - 09052019.pdf
Exploration - Document Image Quality Assessment difficulty_collection (LoC Manuscript/Mixed material) U-Net, DIQA visual difficulty correlation Progress report - Chulwoo Pack - 10312019.pdf
Exploration - Digitization Type Differentiation Civil War Campaign ResNeXt Classify micrpfilm or scanning Progress report - Yi Liu - 09052019.pdf
Progress report - Yi Liu - 10292019.pdf

Authors

  • Yi Liu - research associate and developer
  • Chulwoo (Mike) Pack - research associate and developer
  • Elizabeth Lorang - senior adviser
  • Leen-Kiat Soh - senior adviser
  • Ashlyn Stewart - research assistant

License

This project is licensed under the GPL License - see the LICENSE file for details

About

No description, website, or topics provided.

Resources

License

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.