MEL: Metadata Extractor & Loader

What is MEL?

Implements a set of classes and functions to extract metadata (and textual content) from various file formats, as JSON objects.

Core Features

Comprehensive metadata extraction support of various file types/formats.
File attributes extraction: general metadata.
Filetype structure and properties: specific metadata.
Structured data associated to the file: associated metadata.
Content extraction: raw text and binary-to-text conversion.
Analysis of the textual content: pattern matching and keyword extraction.
Input: a file; output: a JSON file with the metadata sets and content.
It can store the result in a document store (by default, CouchDB).
Integrated with The NLP-NER Toolkit for Named Entity Recognition tasks.

Supported File Types/Formats:

.pdf: uses Tesseract-OCR and pdftotext tools.
.docx, .pptx: MSO "core properties".
.doc, .xls, .ppt, .vsd, .mpp: (generalized to OLE 2 files).
.msg: uses Win32 MAPI (Messaging API for Windows) + (OLE 2 file).
.docm: uses a C# (.NET) converter.
.xlsx
.csv
.rtf
.txt | .xml, .html, .htm, .json: (processed as raw text files)
.zip
Images: .jpg, .png, etc.

MEL Architecture

Processing Model: the methods implemented in MEL are generic and can be applied to extract the content and metadata of all supported file types/formats.

Overview Demo

ISWC 2021 - Posters & Demos Track | KGC Tutorial 2022: demo video

TNNT: The NLP-NER Toolkit

What is TNNT?

Implements a pipeline task to automate the extraction of categorised named entities from the unstructured information encoded in the source documents, using diverse Natural Language Processing (NLP) models and tools. TNNT is integrated with MEL.

Core Features

Implements 21 models from 9 NLP tools.
Capability of processing sequentially several blocks of models based on the input settings.
Keeps general processing stats of the models processed.
Generates an integrated summary of all recognised entities from all the processed models.
The results are generated in JSON files (one for each processed model):
- Each model generates the list of categories of the identified entities.
- For each recognised entity, the toolkit retrieves its context information: start index in the document text and sentence.
Hybrid processing data flow supported, either from/to the document store (CouchDB) or via direct processing from files.
All textual content extracted by MEL (with many supported file types/formats such as PDF, DOCX, MSG, and TXT) is processable for the NLP/NER Toolkit.
A built-in RESTful API that provides basic functions to browse the JSON file results and expand/complement/co-relate the NER results by performing other NLP tasks, such as part-of-speech tagging, dependency parsing, co-reference resolution.

Supported NLP-NER tools and models

NLTK.
Stanford NER tagger: class_3, class_4, class_7.
Stanza.
spaCy: en_core_web_sm, en_core_web_md, en_core_web_lg
Allen NLP: ELMo_NER, fine-grained_NER.
Deep Pavlov: standard_onto, bert_onto, standard_conll2003, bert_conll2003.
Polyglot.
Flair: standard, ontonotes, fast, fast_ontonotes, pooled.
Google BERT.

TNNT Architecture

Processing Model: TNNT has been fully integrated with MEL. MEL settings establish the way how TNNT will process some specific block of NER models for the input dataset (either from content stored on CouchDB or from a direct document processing immediately after the metadata extraction). The following diagram presents the toolkit’s processing model: the first two blocks are orchestrated by MEL.

Recognised Categories

From the implemented models, the toolkit can recognised entities from the following categories:

PERSON: People, including fictional.
NORP: Nationalities or religious or political groups.
FAC: Buildings, airports, highways, bridges, etc.
ORG: Companies, agencies, institutions, etc.
GPE: Countries, cities, states.
LOC: Non-GPE locations, mountain ranges, bodies of water.
PRODUCT: Objects, vehicles, foods, etc. (Not services.)
EVENT: Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW: Named documents made into laws.
LANGUAGE: Any named language.
DATE: Absolute or relative dates or periods.
TIME: Times smaller than a day.
PERCENT: Percentage, including “%“.
MONEY: Monetary values, including unit.
QUANTITY: Measurements, as of weight or distance.
ORDINAL: "first", "second", etc.
CARDINAL: Numerals that do not fall under another type.

Overview Demo

K-CAP 2021 | KGC Tutorial 2022: demo video

Contacts

Sergio J. Rodríguez Méndez <Sergio.RodriguezMendez [at] anu.edu.au>

License

{MEL+TNNT} is publicly available under an MIT license, as specified in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
code		code
demo		demo
docs		docs
examples		examples
LICENSE		LICENSE
README.md		README.md
desktop.ini		desktop.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

demo

demo

docs

docs

examples

examples

LICENSE

LICENSE

README.md

README.md

desktop.ini

desktop.ini

Repository files navigation

MEL: Metadata Extractor & Loader

What is MEL?

Core Features

Supported File Types/Formats:

MEL Architecture

Overview Demo

TNNT: The NLP-NER Toolkit

What is TNNT?

Core Features

Supported NLP-NER tools and models

TNNT Architecture

Recognised Categories

Overview Demo

Contacts

License

About

Releases

Packages

Contributors 2

Languages

Navigation Menu

License

KGCP/MEL-TNNT

Folders and files

Latest commit

History

Repository files navigation

MEL: Metadata Extractor & Loader

What is MEL?

Core Features

Supported File Types/Formats:

MEL Architecture

Overview Demo

TNNT: The NLP-NER Toolkit

What is TNNT?

Core Features

Supported NLP-NER tools and models

TNNT Architecture

Recognised Categories

Overview Demo

Contacts

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages