Abnormity Monitoring for National Nuclear Incidents in United States
Project primarily funded by NSF (National Science Foundation)

MOTIVATION
This project is part of the Nuclear Incident Visualizer in School of Computation and Information, University of Pittsburgh, aiming to visualize and give timely alert when suspicious incidents were reported from different data source. However, it is hard to define "suspicious" without manpower. The system cannot work automatically to give alarm.
United States Nuclear Regulatory Commission (NRC) records and updates nuclear incidents daily on the website. Each incident has an emergency class and short event text to describe. In this project we can try to train the event texts to do classification about whether an incident is emergency or not. Ideally we want to utilize this model to give an alert or notification once we have the description of an incident.
Crawled data from NRC website by Python
Cleansed the dataset to get clear event text
Tokenized the text to corpus and document-term matrix, generating features
Selected features by calculating tf-idf score, BNS score, and chi-square
Applied SVM, one-class SVM models for incident emergency level classification
Integrated the machine classification results into the visualization system (In progress)
ACHIEVEMENT FLOW
DATASET
NRC FEED DATA
-
NRC provides a daily RSS feed on its website listing current nuclear incidents
-
Time Period: 12/28/04 -- 01/02/17
-
Events count: 10152
-
Emergency type distribution:
-
Non-Emergency: 9786
-
Emergency: 366
-
-
Types of information (column): id, event_number, event_type, emergency_class, notification_date, notification_time, event_date, event_time, state, event_text
-
Events labeled “emergency” are sparse (ratio: 3.06%)
-
Most events are not necessarily relevant
NRC FEED DATA (WITH HUMAN ANOTATION)
-
To solve problems of NRC data mentioned above, we have team of GSPIA () to annotate random sample of NRC data
-
Events count: 1000
-
Label distribution
-
Non- relavant: 850
-
Relavant: 150
-
-
Positive label ratio: 15%
CLASSIFICATION
FEATURE SELECTION
-
There are collectively 10393 features in the corpus
-
TF-IDF (Term Frequency–Inverse Document Frequency)
-
BNS (Bi-Normal Separation)
-
For each method, we choose the top 1000 features to do classification
Classifiers
-
SVM
-
One-SVM
EVALUATION METRICS
-
10-fold cross-validation
-
error, accuracy, precision, recall, fscore, and AUC (with an emphasis on AUC because of imbalance classification)
PERFORMANCE

