Abnormity Monitoring for National Nuclear Incidents in United States

Project primarily funded by NSF (National Science Foundation)

MOTIVATION

This project is part of the Nuclear Incident Visualizer in School of Computation and Information, University of Pittsburgh, aiming to visualize and give timely alert when suspicious incidents were reported from different data source. However, it is hard to define "suspicious" without manpower. The system cannot work automatically to give alarm.

 

United States Nuclear Regulatory Commission (NRC) records and updates nuclear incidents daily on the website. Each incident has an emergency class and short event text to describe. In this project we can try to train the event texts to do classification about whether an incident is emergency or not. Ideally we want to utilize this model to give an alert or notification once we have the description of an incident.

Crawled data from NRC website by Python

Cleansed the dataset to get clear event text

Tokenized the text to corpus and document-term matrix, generating features

Selected features by calculating tf-idf score, BNS score, and chi-square

 

Applied SVM, one-class SVM models for incident emergency level classification

Integrated the machine classification results into the visualization system (In progress)

ACHIEVEMENT FLOW
DATASET

NRC FEED DATA

  • NRC provides a daily RSS feed on its website listing current nuclear incidents​

  • Time Period: 12/28/04 -- 01/02/17

  • Events count: 10152

  • Emergency type distribution:

    • Non-Emergency: 9786

    • Emergency: 366

  • Types of information (column): id, event_number, event_type, emergency_class, notification_date, notification_time, event_date, event_time, state, event_text

  • Events labeled “emergency” are sparse (ratio: 3.06%)

  • Most events are not necessarily relevant

NRC FEED DATA (WITH HUMAN ANOTATION)

  • To solve problems of NRC data mentioned above, we have team of GSPIA () to annotate random sample of NRC data

  • Events count: 1000

  •  Label distribution

    • Non- relavant: 850​

    • Relavant: 150 

  • Positive label ratio: 15%

CLASSIFICATION

FEATURE SELECTION

  • There are collectively 10393 features in the corpus

  • TF-IDF (Term Frequency–Inverse Document Frequency)

  • BNS (Bi-Normal Separation)

  • For each method, we choose the top 1000 features to do classification

Classifiers

  • SVM

  • One-SVM

EVALUATION METRICS

  • 10-fold cross-validation

  • error, accuracy, precision, recall, fscore, and AUC (with an emphasis on AUC because of imbalance classification)

PERFORMANCE

© by YUE SU