The AI Method Evaluation Ontology

Abstract

The AI Method Evaluation Ontology is an ontology that models the assessments, such as accuracy, F1 score, etc. of an AI Method.

An assessment activity is performed to evaluate the performance of a AI Method, using a metric which defines how the assessment should be performed. The outcome of the assessment is captured in the Result concept. The metric measuress some aspect of the AI Method, such as accuracy, precision, recall, completeness, etc. The assessment is performed by an Agent, either software or human.

The Prov properties startedAtTime and endedAtTime are used to record when the assessment took place.

This pattern is based on that of Qual-O defined in C. Baillie, P. Edwards, and E. Pignotti, 2015. QUAL: A Provenance-Aware Quality Model. J. Data and Information Quality 5, 3, Article 12 (February 2015) DOI:https://doi.org/10.1145/2700413.

This ontology was created as part of the iSee project (https://isee4xai.com) which received funding from EPSRC under the grant number EP/V061755/1. iSee is part of the CHIST-ERA pathfinder programme for European coordinated research on future and emerging information and communication technologies.

Table 1: Namespaces used in the document
aieval	<https://www.w3id.org/iSeeOnto/aimethodevaluation>
schema	<http://schema.org>
owl	<http://www.w3.org/2002/07/owl>
Fowlkes	<http://www.w3id.org/iSeeOnto/aimodelevaluationFowlkes–>
xsd	<http://www.w3.org/2001/XMLSchema>
skos	<http://www.w3.org/2004/02/skos/core>
rdfs	<http://www.w3.org/2000/01/rdf-schema>
cito	<http://purl.org/spar/cito>
prov-o	<http://www.w3.org/TR/prov-o>
terms	<http://purl.org/dc/terms>
xml	<http://www.w3.org/XML/1998/namespace>
vann	<http://purl.org/vocab/vann>
Youden	<http://www.w3id.org/iSeeOnto/aimodelevaluationYouden'>
aimodel	<http://www.w3id.org/iSeeOnto/aimodel>
prov	<http://www.w3.org/ns/prov>
foaf	<http://xmlns.com/foaf/0.1>
void	<http://rdfs.org/ns/void>
resource	<http://semanticscience.org/resource>
Qual-O	<http://sensornet.abdn.ac.uk/onts/Qual-O>
protege	<http://protege.stanford.edu/plugins/owl/protege>
cpannotationschema	<http://www.ontologydesignpatterns.org/schemas/cpannotationschema.owl>
eo	<https://purl.org/heals/eo>
core	<http://purl.org/vocab/frbr/core>
rdf	<http://www.w3.org/1999/02/22-rdf-syntax-ns>
aieval	<http://www.w3id.org/iSeeOnto/aimodelevaluation>
obo	<http://purl.obolibrary.org/obo>
dc	<http://purl.org/dc/elements/1.1>

Cross reference for The AI Method Evaluation Ontology classes, properties and dataproperties back to ToC

This section provides details for each class and property defined by The AI Method Evaluation Ontology.

AI Model Assessment^c back to ToC or Class ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AIModelAssessment

The activity performed that made an assessment of a AI Model, guided by a metric, to generate a Result. The assessment can be associated with the agent (e.g. User) that performed the assessment.

has super-classes: assessment ^c

AI Model Assessment Dimension^c back to ToC or Class ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AIModelAssessmentDimension

The dimension, such as Accuracy, Prevision, Recall, etc. that an evaluation assessed.

has super-classes: dimension ^c
has members: Data Quality ⁿⁱ, Network Usage ⁿⁱ, Performance ⁿⁱ, Robustness ⁿⁱ, Speed ⁿⁱ, Stability ⁿⁱ

AI Model Assessment Metric^c back to ToC or Class ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AIModelAssessmentMetric

The criteria used to guide the assessment of an AI Model and determine the result.

has super-classes: metric ^c
has members: AU-ROC ⁿⁱ, Accuracy ⁿⁱ, Adjusted Rand Index ⁿⁱ, BLEU ⁿⁱ, Brier Score ⁿⁱ, Calinski-Harabasz Index ⁿⁱ, Cohen's Kappa Coefficient ⁿⁱ, Coverage ⁿⁱ, Davies-Bouldin Index ⁿⁱ, Dice Index ⁿⁱ, Discounted cumulative gain ⁿⁱ, Diversity ⁿⁱ, Dunn Index ⁿⁱ, F1-score (macro) ⁿⁱ, F1-score (micro) ⁿⁱ, Fowlkes–Mallows index ⁿⁱ, Hamming Loss ⁿⁱ, Hopkins statistic ⁿⁱ, Inference Speed ⁿⁱ, Jaccard Score ⁿⁱ, METEOR ⁿⁱ, Mathews Correlation Coefficient ⁿⁱ, Mean Absolute Error ⁿⁱ, Mean Squared Error ⁿⁱ, Mutual Information ⁿⁱ, NIST ⁿⁱ, Perplexity ⁿⁱ, Precision ⁿⁱ, Purity ⁿⁱ, R squared ⁿⁱ, ROUGE ⁿⁱ, Rand Index ⁿⁱ, Recall ⁿⁱ, Recommender persistence ⁿⁱ, Root Mean Squared Error ⁿⁱ, Serendipity ⁿⁱ, Silhouette Score ⁿⁱ, Training Speed ⁿⁱ, True Negative Rate ⁿⁱ, WER ⁿⁱ, Youden's J statistic ⁿⁱ

AI Model Assessment Result^c back to ToC or Class ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AIModelAssessmentResult

The result of assessing a specified dimension of an AI Model, as described by the metric specification.

has super-classes: result ^c

Named Individuals

Accuracy
Adjusted Rand Index
AU-ROC
BLEU
Brier Score
Calinski-Harabasz Index
Cohen's Kappa Coefficient
Coverage
Data Quality
Davies-Bouldin Index
Dice Index
Discounted cumulative gain
Diversity
Dunn Index
F1-score (macro)
F1-score (micro)
Fowlkes–Mallows index
Hamming Loss
Hopkins statistic
Inference Speed
Jaccard Score
Mathews Correlation Coefficient
Mean Absolute Error
Mean Squared Error
METEOR
Mutual Information
Network Usage
NIST
Performance
Perplexity
Precision
Purity
R squared
Rand Index
Recall
Recommender persistence
Robustness
Root Mean Squared Error
ROUGE
Serendipity
Silhouette Score
Speed
Stability
Training Speed
True Negative Rate
WER
Youden's J statistic

Accuracyⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Accuracy

Accuracy is how close a given set of measurements (observations or readings) are to their true value.

belongs to: AI Model Assessment Metric ^c

Adjusted Rand Indexⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Adjusted_Rand_Index

The adjusted Rand index is the corrected-for-chance version of the Rand index.

Source: https://en.wikipedia.org/wiki/Rand_index#Adjusted_Rand_index

belongs to: AI Model Assessment Metric ^c

AU-ROCⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AU-ROC

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

Source: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

belongs to: AI Model Assessment Metric ^c

BLEUⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#BLEU

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.

Source: https://en.wikipedia.org/wiki/BLEU

belongs to: AI Model Assessment Metric ^c

Brier Scoreⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Brier_Score

The Brier Score is a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied to predicted probabilities.

Source: https://en.wikipedia.org/wiki/Brier_score

belongs to: AI Model Assessment Metric ^c

Calinski-Harabasz Indexⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Calinski-Harabasz_Index

The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared).

Source: https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index

belongs to: AI Model Assessment Metric ^c

Cohen's Kappa Coefficientⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Cohens_Kappa_Coefficient

Cohen's kappa coefficient (κ, lowercase Greek kappa) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items.

Source: https://en.wikipedia.org/wiki/Cohen%27s_kappa

belongs to: AI Model Assessment Metric ^c

Coverageⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Coverage

belongs to: AI Model Assessment Metric ^c

Data Qualityⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Data_Quality

belongs to: AI Model Assessment Dimension ^c

Davies-Bouldin Indexⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Davies-Bouldin_Index

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset.

Source: https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index

belongs to: AI Model Assessment Metric ^c

Dice Indexⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Dice_Index

The Sørensen–Dice coefficient is a statistic used to gauge the similarity of two samples.

Source: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

belongs to: AI Model Assessment Metric ^c

Discounted cumulative gainⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Discounted_cumulative_gain

Discounted cumulative gain (DCG) is a measure of ranking quality.

belongs to: AI Model Assessment Metric ^c

Diversityⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Diversity

Source: https://en.wikipedia.org/wiki/Recommender_system#Performance_measures

belongs to: AI Model Assessment Metric ^c

Dunn Indexⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Dunn_Index

The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms.

Source: https://en.wikipedia.org/wiki/Dunn_index

belongs to: AI Model Assessment Metric ^c

F1-score (macro)ⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#F1-score_(macro)

F-score or F-measure (macro) is a measure of a test's accuracy calculated from macro-averaging (taking all classes as equally important) the precision and recall of the test.

Source: https://en.wikipedia.org/wiki/F-score

belongs to: AI Model Assessment Metric ^c

F1-score (micro)ⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#F1-score_(micro)

F-score or F-measure (micro) is a measure of a test's accuracy calculated from micro-averaging (biased by class frequency) the precision and recall of the test.

Source: https://en.wikipedia.org/wiki/F-score

belongs to: AI Model Assessment Metric ^c

Fowlkes–Mallows indexⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Fowlkes–Mallows_index

The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings (clusters obtained after a clustering algorithm), and also a metric to measure confusion matrices.

Source: https://en.wikipedia.org/wiki/Fowlkes%E2%80%93Mallows_index

belongs to: AI Model Assessment Metric ^c

Hamming Lossⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Hamming_Loss

Source: https://en.wikipedia.org/wiki/Hamming_distance

belongs to: AI Model Assessment Metric ^c

Hopkins statisticⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Hopkins_statistic

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.

Source: https://en.wikipedia.org/wiki/Hopkins_statistic

belongs to: AI Model Assessment Metric ^c

Inference Speedⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Inference_Speed

belongs to: AI Model Assessment Metric ^c

Jaccard Scoreⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Jaccard_Score

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets.

Source: https://en.wikipedia.org/wiki/Jaccard_index

belongs to: AI Model Assessment Metric ^c

Mathews Correlation Coefficientⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mathews_Correlation_Coefficient

Matthews correlation coefficient (MCC) is used as a measure of the quality of binary (two-class) classifications.

Source: https://en.wikipedia.org/wiki/Phi_coefficient

belongs to: AI Model Assessment Metric ^c

Mean Absolute Errorⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mean_Absolute_Error

Mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon as the sum of absolute errors divided by the sample size.

Source: https://en.wikipedia.org/wiki/Mean_absolute_error

belongs to: AI Model Assessment Metric ^c

Mean Squared Errorⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mean_Squared_Error

Mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors - that is, the average squared difference between the estimated values and the actual value.

Source: https://en.wikipedia.org/wiki/Mean_squared_error

belongs to: AI Model Assessment Metric ^c

METEORⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#METEOR

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.

Source: https://en.wikipedia.org/wiki/METEOR

belongs to: AI Model Assessment Metric ^c

Mutual Informationⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mutual_Information

The mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables.

Source: https://en.wikipedia.org/wiki/Mutual_information

belongs to: AI Model Assessment Metric ^c

Network Usageⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Network_Usage

belongs to: AI Model Assessment Dimension ^c

NISTⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#NIST

NIST is a method based on the BLEU metric for evaluating the quality of text which has been translated using machine translation.

Source: https://en.wikipedia.org/wiki/NIST_(metric)

belongs to: AI Model Assessment Metric ^c

Performanceⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#PredictivePerformance

belongs to: AI Model Assessment Dimension ^c

Perplexityⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Perplexity

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample.

Source: https://en.wikipedia.org/wiki/Perplexity

belongs to: AI Model Assessment Metric ^c

Precisionⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Precision

Precision (positive predictive value) is the fraction of relevant instances among the retrieved instances.

Source: https://en.wikipedia.org/wiki/Precision_and_recall

belongs to: AI Model Assessment Metric ^c

Purityⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Purity

Purity is a measure of the extent to which clusters contain a single class.

Source: https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment

belongs to: AI Model Assessment Metric ^c

R squaredⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#R_squared

R2 (coefficient of determination) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

Source: https://en.wikipedia.org/wiki/Coefficient_of_determination

belongs to: AI Model Assessment Metric ^c

Rand Indexⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Rand_Index

The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings.

Source: https://en.wikipedia.org/wiki/Rand_index

belongs to: AI Model Assessment Metric ^c

Recallⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Recall

Recall (sensitivity) is the fraction of relevant instances that were retrieved.

Source: https://en.wikipedia.org/wiki/Precision_and_recall

belongs to: AI Model Assessment Metric ^c

Recommender persistenceⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Recommender_persistence

belongs to: AI Model Assessment Metric ^c

Robustnessⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Robustness

belongs to: AI Model Assessment Dimension ^c

Root Mean Squared Errorⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Root_Mean_Squared_Error

The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed.

Source: https://en.wikipedia.org/wiki/Root-mean-square_deviation

belongs to: AI Model Assessment Metric ^c

ROUGEⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#ROUGE

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing.

Source: https://en.wikipedia.org/wiki/ROUGE_(metric)

belongs to: AI Model Assessment Metric ^c

Serendipityⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Serendipity

belongs to: AI Model Assessment Metric ^c

Silhouette Scoreⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Silhouette_Score

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified.

Source: https://en.wikipedia.org/wiki/Silhouette_(clustering)

belongs to: AI Model Assessment Metric ^c

Speedⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Speed

belongs to: AI Model Assessment Dimension ^c

Stabilityⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Stability

belongs to: AI Model Assessment Dimension ^c

Training Speedⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Training_Speed

belongs to: AI Model Assessment Metric ^c

True Negative Rateⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#True_Negative_Rate

Specificity (true negative rate) refers to the probability of a negative test, conditioned on truly being negative.

Source: https://en.wikipedia.org/wiki/Sensitivity_and_specificity

belongs to: AI Model Assessment Metric ^c

WERⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#WER

Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.

Source: https://en.wikipedia.org/wiki/Word_error_rate

belongs to: AI Model Assessment Metric ^c

Youden's J statisticⁿⁱ back to ToC or Named Individual ToC

IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Youden's_J_statistic

Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test.

Source: https://en.wikipedia.org/wiki/Youden%27s_J_statistic

belongs to: AI Model Assessment Metric ^c

Legend back to ToC

^c: Classes
^op: Object Properties
^dp: Data Properties
ⁿⁱ: Named Individuals

The AI Method Evaluation Ontology

Abstract

Table of contents

Introduction back to ToC

Namespace declarations