Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms from natural language text. As building ontologies manually is extremely labor-intensive and time consuming, there is great motivation to automate the process.
During semantic annotation, natural language text is augmented with metadata (often represented in "RDFa), which should make the semantics of contained terms machine-understandable. At this process, which is generally semi-automatic, knowledge is extracted in the sense, that a link between lexical terms and for example concepts from ontologies is established. Thus, knowledge is gained, which meaning of a term in the processed context was intended and therefore the meaning of the text is grounded in "machine-readable data with the ability to draw inferences. Semantic annotation is typically split into the following two subtasks.
At the terminology extraction level, lexical terms from the text are extracted. For this purpose a tokenizer determines at first the word boundaries and solves abbreviations. Afterwards terms from the text, which correspond to a concept, are extracted with the help of a domain-specific lexicon to link these at entity linking.
In entity linking  a link between the extracted lexical terms from the source text and the concepts from an ontology or knowledge base such as "DBpedia is established. For this, candidate-concepts are detected appropriately to the several meanings of a term with the help of a lexicon. Finally, the context of the terms is analyzed to determine the most appropriate disambiguation and to assign the term to the correct concept.
The following criteria can be used to categorize tools, which extract knowledge from natural language text.
|Source||Which input formats can be processed by the tool (e.g. plain text, HTML or PDF)?|
|Access Paradigm||Can the tool query the data source or requires a whole dump for the extraction process?|
|Data Synchronization||Is the result of the extraction process synchronized with the source?|
|Uses Output Ontology||Does the tool link the result with an ontology?|
|Mapping Automation||How automated is the extraction process (manual, semi-automatic or automatic)?|
|Requires Ontology||Does the tool need an ontology for the extraction?|
|Uses GUI||Does the tool offer a graphical user interface?|
|Approach||Which approach (IE, OBIE, OL or SA) is used by the tool?|
|Extracted Entities||Which types of entities (e.g. named entities, concepts or relationships) can be extracted by the tool?|
|Applied Techniques||Which techniques are applied (e.g. NLP, statistical methods, clustering or machine learning)?|
|Output Model||Which model is used to represent the result of the tool (e. g. RDF or OWL)?|
|Supported Domains||Which domains are supported (e.g. economy or biology)?|
|Supported Languages||Which languages can be processed (e.g. English or German)?|
The following table characterizes some tools for Knowledge Extraction from natural language sources.
|Name||Source||Access Paradigm||Data Synchronization||Uses Output Ontology||Mapping Automation||Requires Ontology||Uses GUI||Approach||Extracted Entities||Applied Techniques||Output Model||Supported Domains||Supported Languages|
|AeroText ||plain text, HTML, XML, SGML||dump||no||yes||automatic||yes||yes||IE||named entities, relationships, events||linguistic rules||proprietary||domain-independent||English, Spanish, Arabic, Chinese, indonesian|
|AlchemyAPI ||plain text, HTML||automatic||yes||SA||multilingual|
|ANNIE ||plain text||dump||yes||yes||IE||finite state algorithms||multilingual|
|ASIUM ||plain text||dump||semi-automatic||yes||OL||concepts, concept hierarchy||NLP, clustering|
|Attensity Exhaustive Extraction ||automatic||IE||named entities, relationships, events||NLP|
|Dandelion API||plain text, HTML, URL||REST||no||no||automatic||no||yes||SA||named entities, concepts||statistical methods||JSON||domain-independent||multilingual|
|DBpedia Spotlight ||plain text, HTML||dump, SPARQL||yes||yes||automatic||no||yes||SA||annotation to each word, annotation to non-stopwords||NLP, statistical methods, machine learning||RDFa||domain-independent||English|
|EntityClassifier.eu ||plain text, HTML||dump||yes||yes||automatic||no||yes||IE, OL, SA||annotation to each word, annotation to non-stopwords||rule-based grammar||XML||domain-independent||English, German, Dutch|
|K-Extractor||plain text, HTML, XML, PDF, MS Office, e-mail||dump, SPARQL||yes||yes||automatic||no||yes||IE, OL, SA||concepts, named entities, instances, concept hierarchy, generic relationships, user-defined relationships, events, modality, tense, entity linking, event linking, sentiment||NLP, machine learning, heuristic rules||RDF, OWL, proprietary XML||domain-independent||English, Spanish|
|iDocument ||HTML, PDF, DOC||SPARQL||yes||yes||OBIE||instances, property values||NLP||personal, business|
|NetOwl Extractor ||plain text, HTML, XML, SGML, PDF, MS Office||dump||No||Yes||Automatic||yes||Yes||IE||named entities, relationships, events||NLP||XML, JSON, RDF-OWL, others||multiple domains||English, Arabic Chinese (Simplified and Traditional), French, Korean, Persian (Farsi and Dari), Russian, Spanish|
|OntoGen ||semi-automatic||yes||OL||concepts, concept hierarchy, non-taxonomic relations, instances||NLP, machine learning, clustering|
|OntoLearn ||plain text, HTML||dump||no||yes||automatic||yes||no||OL||concepts, concept hierarchy, instances||NLP, statistical methods||proprietary||domain-independent||English|
|OntoLearn Reloaded||plain text, HTML||dump||no||yes||automatic||yes||no||OL||concepts, concept hierarchy, instances||NLP, statistical methods||proprietary||domain-independent||English|
|OntoSyphon ||HTML, PDF, DOC||dump, search engine queries||no||yes||automatic||yes||no||OBIE||concepts, relations, instances||NLP, statistical methods||RDF||domain-independent||English|
|ontoX ||plain text||dump||no||yes||semi-automatic||yes||no||OBIE||instances, datatype property values||heuristic-based methods||proprietary||domain-independent||language-independent|
|OpenCalais||plain text, HTML, XML||dump||no||yes||automatic||yes||no||SA||annotation to entities, annotation to events, annotation to facts||NLP, machine learning||RDF||domain-independent||English, French, Spanish|
|PoolParty Extractor ||plain text, HTML, DOC, ODT||dump||no||yes||automatic||yes||yes||OBIE||named entities, concepts, relations, concepts that categorize the text, enrichments||NLP, machine learning, statistical methods||RDF, OWL||domain-independent||English, German, Spanish, French|
|Rosoka||plain text, HTML, XML, SGML, PDF, MS Office||dump||Yes||Yes||Automatic||no||Yes||IE||named entity extraction, entity resolution, relationship extraction, attributes, concepts, multi-vector sentiment analysis, geotagging, language identification, machine learning||NLP||XML, JSON, POJO||multiple domains||Multilingual 200+ Languages|
|SCOOBIE||plain text, HTML||dump||no||yes||automatic||no||no||OBIE||instances, property values, RDFS types||NLP, machine learning||RDF, RDFa||domain-independent||English, German|
|SemTag ||HTML||dump||no||yes||automatic||yes||no||SA||machine learning||database record||domain-independent||language-independent|
|smart FIX||plain text, HTML, PDF, DOC, e-Mail||dump||yes||no||automatic||no||yes||OBIE||named entities||NLP, machine learning||proprietary||domain-independent||English, German, French, Dutch, polish|
|Text2Onto ||plain text, HTML, PDF||dump||yes||no||semi-automatic||yes||yes||OL||concepts, concept hierarchy, non-taxonomic relations, instances, axioms||NLP, statistical methods, machine learning, rule-based methods||OWL||deomain-independent||English, German, Spanish|
|Text-To-Onto ||plain text, HTML, PDF, PostScript||dump||semi-automatic||yes||yes||OL||concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations||NLP, machine learning, clustering, statistical methods||German|
|ThatNeedle||Plain Text||dump||automatic||no||concepts, relations, hierarchy||NLP, proprietary||JSON||multiple domains||English|
|The Wiki Machine ||plain text, HTML, PDF, DOC||dump||no||yes||automatic||yes||yes||SA||annotation to proper nouns, annotation to common nouns||machine learning||RDFa||domain-independent||English, German, Spanish, French, Portuguese, Italian, Russian|
|ThingFinder ||IE||named entities, relationships, events||multilingual|
Knowledge discovery describes the process of automatically searching large volumes of "data for patterns that can be considered "knowledge about the data. It is often described as deriving "knowledge from the input "data. Knowledge discovery developed out of the "data mining domain, and is closely related to it both in terms of methodology and terminology.
The most well-known branch of "data mining is knowledge discovery, also known as "knowledge discovery in databases (KDD). Just as many other forms of knowledge discovery it creates "abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery. Often the outcomes from knowledge discovery are not actionable, "actionable knowledge discovery, also known as "domain driven data mining, aims to discover and deliver actionable knowledge and insights.
Another promising application of knowledge discovery is in the area of "software modernization, weakness discovery and compliance which involves understanding existing software artifacts. This process is related to a concept of "reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An "entity relationship is a frequent format of representing knowledge obtained from existing software. "Object Management Group (OMG) developed specification "Knowledge Discovery Metamodel (KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery of existing code. Knowledge discovery from existing software systems, also known as "software mining is closely related to "data mining, since existing software artifacts contain enormous value for risk management and "business value, key for the evaluation and evolution of software systems. Instead of mining individual "data sets, "software mining focuses on "metadata, such as process flows (e.g. data flows, control flows, & call maps), architecture, database schemas, and business rules/terms/process.
Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format.
entityclassifierwas invoked but never defined (see the "help page).