The AI Method Evaluation Ontology is an ontology that models the assessments, such as accuracy, F1 score, etc. of an AI Method.
An assessment activity is performed to evaluate the performance of a AI Method, using a metric which defines how the assessment should be performed. The outcome of the assessment is captured in the Result concept. The metric measuress some aspect of the AI Method, such as accuracy, precision, recall, completeness, etc. The assessment is performed by an Agent, either software or human.
The Prov properties startedAtTime and endedAtTime are used to record when the assessment took place.
This pattern is based on that of Qual-O defined in C. Baillie, P. Edwards, and E. Pignotti, 2015. QUAL: A Provenance-Aware Quality Model. J. Data and Information Quality 5, 3, Article 12 (February 2015) DOI:https://doi.org/10.1145/2700413.
This ontology was created as part of the iSee project (https://isee4xai.com) which received funding from EPSRC under the grant number EP/V061755/1. iSee is part of the CHIST-ERA pathfinder programme for European coordinated research on future and emerging information and communication technologies.
aieval | <https://www.w3id.org/iSeeOnto/aimethodevaluation> |
schema | <http://schema.org> |
owl | <http://www.w3.org/2002/07/owl> |
Fowlkes | <http://www.w3id.org/iSeeOnto/aimodelevaluationFowlkes–> |
xsd | <http://www.w3.org/2001/XMLSchema> |
skos | <http://www.w3.org/2004/02/skos/core> |
rdfs | <http://www.w3.org/2000/01/rdf-schema> |
cito | <http://purl.org/spar/cito> |
prov-o | <http://www.w3.org/TR/prov-o> |
terms | <http://purl.org/dc/terms> |
xml | <http://www.w3.org/XML/1998/namespace> |
vann | <http://purl.org/vocab/vann> |
Youden | <http://www.w3id.org/iSeeOnto/aimodelevaluationYouden'> |
aimodel | <http://www.w3id.org/iSeeOnto/aimodel> |
prov | <http://www.w3.org/ns/prov> |
foaf | <http://xmlns.com/foaf/0.1> |
void | <http://rdfs.org/ns/void> |
resource | <http://semanticscience.org/resource> |
Qual-O | <http://sensornet.abdn.ac.uk/onts/Qual-O> |
protege | <http://protege.stanford.edu/plugins/owl/protege> |
cpannotationschema | <http://www.ontologydesignpatterns.org/schemas/cpannotationschema.owl> |
eo | <https://purl.org/heals/eo> |
core | <http://purl.org/vocab/frbr/core> |
rdf | <http://www.w3.org/1999/02/22-rdf-syntax-ns> |
aieval | <http://www.w3id.org/iSeeOnto/aimodelevaluation> |
obo | <http://purl.obolibrary.org/obo> |
dc | <http://purl.org/dc/elements/1.1> |
This ontology has the following classes and properties.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AIModelAssessment
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AIModelAssessmentDimension
The dimension, such as Accuracy, Prevision, Recall, etc. that an evaluation assessed.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AIModelAssessmentMetric
The criteria used to guide the assessment of an AI Model and determine the result.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Accuracy
Accuracy is how close a given set of measurements (observations or readings) are to their true value.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Adjusted_Rand_Index
The adjusted Rand index is the corrected-for-chance version of the Rand index.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#AU-ROC
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#BLEU
BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Brier_Score
The Brier Score is a strictly proper score function or strictly proper scoring rule that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied to predicted probabilities.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Calinski-Harabasz_Index
The index is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared).
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Cohens_Kappa_Coefficient
Cohen's kappa coefficient (κ, lowercase Greek kappa) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Coverage
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Data_Quality
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Davies-Bouldin_Index
The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Dice_Index
The Sørensen–Dice coefficient is a statistic used to gauge the similarity of two samples.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Discounted_cumulative_gain
Discounted cumulative gain (DCG) is a measure of ranking quality.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Diversity
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Dunn_Index
The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#F1-score_(macro)
F-score or F-measure (macro) is a measure of a test's accuracy calculated from macro-averaging (taking all classes as equally important) the precision and recall of the test.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#F1-score_(micro)
F-score or F-measure (micro) is a measure of a test's accuracy calculated from micro-averaging (biased by class frequency) the precision and recall of the test.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Fowlkes–Mallows_index
The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings (clusters obtained after a clustering algorithm), and also a metric to measure confusion matrices.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Hamming_Loss
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Hopkins_statistic
The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Inference_Speed
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Jaccard_Score
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mathews_Correlation_Coefficient
Matthews correlation coefficient (MCC) is used as a measure of the quality of binary (two-class) classifications.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mean_Absolute_Error
Mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon as the sum of absolute errors divided by the sample size.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mean_Squared_Error
Mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors - that is, the average squared difference between the estimated values and the actual value.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#METEOR
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Mutual_Information
The mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Network_Usage
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#NIST
NIST is a method based on the BLEU metric for evaluating the quality of text which has been translated using machine translation.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#PredictivePerformance
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Perplexity
Perplexity is a measurement of how well a probability distribution or probability model predicts a sample.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Precision
Precision (positive predictive value) is the fraction of relevant instances among the retrieved instances.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Purity
Purity is a measure of the extent to which clusters contain a single class.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#R_squared
R2 (coefficient of determination) is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Rand_Index
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Recall
Recall (sensitivity) is the fraction of relevant instances that were retrieved.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Recommender_persistence
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Robustness
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Root_Mean_Squared_Error
The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#ROUGE
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Serendipity
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Silhouette_Score
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Speed
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Stability
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Training_Speed
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#True_Negative_Rate
Specificity (true negative rate) refers to the probability of a negative test, conditioned on truly being negative.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#WER
Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system.
IRI: http://www.w3id.org/iSeeOnto/aimodelevaluation#Youden's_J_statistic
Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test.
Add your references here. It is recommended to have them as a list.
The authors would like to thank Silvio Peroni for developing LODE, a Live OWL Documentation Environment, which is used for representing the Cross Referencing Section of this document and Daniel Garijo for developing Widoco, the program used to create the template used in this documentation.
The activity performed that made an assessment of a AI Model, guided by a metric, to generate a Result. The assessment can be associated with the agent (e.g. User) that performed the assessment.