See more Knowledge extraction articles on AOD.

Powered by
Share this page on
Article provided by Wikipedia

Main article: "Ontology learning

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms from natural language text. As building ontologies manually is extremely labor-intensive and time consuming, there is great motivation to automate the process.

Semantic annotation (SA)[edit]

During semantic annotation,[12] natural language text is augmented with metadata (often represented in "RDFa), which should make the semantics of contained terms machine-understandable. At this process, which is generally semi-automatic, knowledge is extracted in the sense, that a link between lexical terms and for example concepts from ontologies is established. Thus, knowledge is gained, which meaning of a term in the processed context was intended and therefore the meaning of the text is grounded in "machine-readable data with the ability to draw inferences. Semantic annotation is typically split into the following two subtasks.

  1. Terminology extraction
  2. Entity linking

At the terminology extraction level, lexical terms from the text are extracted. For this purpose a tokenizer determines at first the word boundaries and solves abbreviations. Afterwards terms from the text, which correspond to a concept, are extracted with the help of a domain-specific lexicon to link these at entity linking.

In entity linking [13] a link between the extracted lexical terms from the source text and the concepts from an ontology or knowledge base such as "DBpedia is established. For this, candidate-concepts are detected appropriately to the several meanings of a term with the help of a lexicon. Finally, the context of the terms is analyzed to determine the most appropriate disambiguation and to assign the term to the correct concept.


The following criteria can be used to categorize tools, which extract knowledge from natural language text.

Source Which input formats can be processed by the tool (e.g. plain text, HTML or PDF)?
Access Paradigm Can the tool query the data source or requires a whole dump for the extraction process?
Data Synchronization Is the result of the extraction process synchronized with the source?
Uses Output Ontology Does the tool link the result with an ontology?
Mapping Automation How automated is the extraction process (manual, semi-automatic or automatic)?
Requires Ontology Does the tool need an ontology for the extraction?
Uses GUI Does the tool offer a graphical user interface?
Approach Which approach (IE, OBIE, OL or SA) is used by the tool?
Extracted Entities Which types of entities (e.g. named entities, concepts or relationships) can be extracted by the tool?
Applied Techniques Which techniques are applied (e.g. NLP, statistical methods, clustering or machine learning)?
Output Model Which model is used to represent the result of the tool (e. g. RDF or OWL)?
Supported Domains Which domains are supported (e.g. economy or biology)?
Supported Languages Which languages can be processed (e.g. English or German)?

The following table characterizes some tools for Knowledge Extraction from natural language sources.

Name Source Access Paradigm Data Synchronization Uses Output Ontology Mapping Automation Requires Ontology Uses GUI Approach Extracted Entities Applied Techniques Output Model Supported Domains Supported Languages
AeroText [14] plain text, HTML, XML, SGML dump no yes automatic yes yes IE named entities, relationships, events linguistic rules proprietary domain-independent English, Spanish, Arabic, Chinese, indonesian
AlchemyAPI [15] plain text, HTML automatic yes SA multilingual
ANNIE [16] plain text dump yes yes IE finite state algorithms multilingual
ASIUM [17] plain text dump semi-automatic yes OL concepts, concept hierarchy NLP, clustering
Attensity Exhaustive Extraction [18] automatic IE named entities, relationships, events NLP
Dandelion API plain text, HTML, URL REST no no automatic no yes SA named entities, concepts statistical methods JSON domain-independent multilingual
DBpedia Spotlight [19] plain text, HTML dump, SPARQL yes yes automatic no yes SA annotation to each word, annotation to non-stopwords NLP, statistical methods, machine learning RDFa domain-independent English [20] plain text, HTML dump yes yes automatic no yes IE, OL, SA annotation to each word, annotation to non-stopwords rule-based grammar XML domain-independent English, German, Dutch
K-Extractor[21][22] plain text, HTML, XML, PDF, MS Office, e-mail dump, SPARQL yes yes automatic no yes IE, OL, SA concepts, named entities, instances, concept hierarchy, generic relationships, user-defined relationships, events, modality, tense, entity linking, event linking, sentiment NLP, machine learning, heuristic rules RDF, OWL, proprietary XML domain-independent English, Spanish
iDocument [23] HTML, PDF, DOC SPARQL yes yes OBIE instances, property values NLP personal, business
NetOwl Extractor [24] plain text, HTML, XML, SGML, PDF, MS Office dump No Yes Automatic yes Yes IE named entities, relationships, events NLP XML, JSON, RDF-OWL, others multiple domains English, Arabic Chinese (Simplified and Traditional), French, Korean, Persian (Farsi and Dari), Russian, Spanish
OntoGen [25] semi-automatic yes OL concepts, concept hierarchy, non-taxonomic relations, instances NLP, machine learning, clustering
OntoLearn [26] plain text, HTML dump no yes automatic yes no OL concepts, concept hierarchy, instances NLP, statistical methods proprietary domain-independent English
OntoLearn Reloaded plain text, HTML dump no yes automatic yes no OL concepts, concept hierarchy, instances NLP, statistical methods proprietary domain-independent English
OntoSyphon [27] HTML, PDF, DOC dump, search engine queries no yes automatic yes no OBIE concepts, relations, instances NLP, statistical methods RDF domain-independent English
ontoX [28] plain text dump no yes semi-automatic yes no OBIE instances, datatype property values heuristic-based methods proprietary domain-independent language-independent
OpenCalais plain text, HTML, XML dump no yes automatic yes no SA annotation to entities, annotation to events, annotation to facts NLP, machine learning RDF domain-independent English, French, Spanish
PoolParty Extractor [29] plain text, HTML, DOC, ODT dump no yes automatic yes yes OBIE named entities, concepts, relations, concepts that categorize the text, enrichments NLP, machine learning, statistical methods RDF, OWL domain-independent English, German, Spanish, French
Rosoka plain text, HTML, XML, SGML, PDF, MS Office dump Yes Yes Automatic no Yes IE named entity extraction, entity resolution, relationship extraction, attributes, concepts, multi-vector sentiment analysis, geotagging, language identification, machine learning NLP XML, JSON, POJO multiple domains Multilingual 200+ Languages
SCOOBIE plain text, HTML dump no yes automatic no no OBIE instances, property values, RDFS types NLP, machine learning RDF, RDFa domain-independent English, German
SemTag [30][31] HTML dump no yes automatic yes no SA machine learning database record domain-independent language-independent
smart FIX plain text, HTML, PDF, DOC, e-Mail dump yes no automatic no yes OBIE named entities NLP, machine learning proprietary domain-independent English, German, French, Dutch, polish
Text2Onto [32] plain text, HTML, PDF dump yes no semi-automatic yes yes OL concepts, concept hierarchy, non-taxonomic relations, instances, axioms NLP, statistical methods, machine learning, rule-based methods OWL deomain-independent English, German, Spanish
Text-To-Onto [33] plain text, HTML, PDF, PostScript dump semi-automatic yes yes OL concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations NLP, machine learning, clustering, statistical methods German
ThatNeedle Plain Text dump automatic no concepts, relations, hierarchy NLP, proprietary JSON multiple domains English
The Wiki Machine [34] plain text, HTML, PDF, DOC dump no yes automatic yes yes SA annotation to proper nouns, annotation to common nouns machine learning RDFa domain-independent English, German, Spanish, French, Portuguese, Italian, Russian
ThingFinder [35] IE named entities, relationships, events multilingual

Knowledge discovery[edit]

Knowledge discovery describes the process of automatically searching large volumes of "data for patterns that can be considered "knowledge about the data.[36] It is often described as deriving "knowledge from the input "data. Knowledge discovery developed out of the "data mining domain, and is closely related to it both in terms of methodology and terminology.[37]

The most well-known branch of "data mining is knowledge discovery, also known as "knowledge discovery in databases (KDD). Just as many other forms of knowledge discovery it creates "abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery. Often the outcomes from knowledge discovery are not actionable, "actionable knowledge discovery, also known as "domain driven data mining,[38] aims to discover and deliver actionable knowledge and insights.

Another promising application of knowledge discovery is in the area of "software modernization, weakness discovery and compliance which involves understanding existing software artifacts. This process is related to a concept of "reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An "entity relationship is a frequent format of representing knowledge obtained from existing software. "Object Management Group (OMG) developed specification "Knowledge Discovery Metamodel (KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery of existing code. Knowledge discovery from existing software systems, also known as "software mining is closely related to "data mining, since existing software artifacts contain enormous value for risk management and "business value, key for the evaluation and evolution of software systems. Instead of mining individual "data sets, "software mining focuses on "metadata, such as process flows (e.g. data flows, control flows, & call maps), architecture, database schemas, and business rules/terms/process.

Input data[edit]

Output formats[edit]

See also[edit]


  1. ^ RDB2RDF Working Group, Website: , charter:, R2RML: RDB to RDF Mapping Language:
  2. ^ LOD2 EU Deliverable 3.1.1 Knowledge Extraction from Structured Sources
  3. ^ "Life in the Linked Data Cloud". Retrieved 2009-11-10. Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format. 
  4. ^ a b Tim Berners-Lee (1998), "Relational Databases on the Semantic Web". Retrieved: February 20, 2011.
  5. ^ Hu et al. (2007), "Discovering Simple Mappings Between Relational Database Schemas and Ontologies", In Proc. of 6th International Semantic Web Conference (ISWC 2007), 2nd Asian Semantic Web Conference (ASWC 2007), LNCS 4825, pages 225‐238, Busan, Korea, 11‐15 November 2007.
  6. ^ R. Ghawi and N. Cullot (2007), "Database-to-Ontology Mapping Generation for Semantic Interoperability". In Third International Workshop on Database Interoperability (InterDB 2007).
  7. ^ Li et al. (2005) "A Semi-automatic Ontology Acquisition Method for the Semantic Web", WAIM, volume 3739 of Lecture Notes in Computer Science, page 209-220. Springer. "doi:10.1007/11563952_19
  8. ^ Tirmizi et al. (2008), "Translating SQL Applications to the Semantic Web", Lecture Notes in Computer Science, Volume 5181/2008 (Database and Expert Systems Applications).;jsessionid=15E8AB2A37BD06DAE59255A1AC3095F0?doi=
  9. ^ Farid Cerbah (2008). "Learning Highly Structured Semantic Repositories from Relational Databases", The Semantic Web: Research and Applications, volume 5021 of Lecture Notes in Computer Science, Springer, Berlin / Heidelberg
  10. ^ a b Wimalasuriya, Daya C.; Dou, Dejing (2010). "Ontology-based information extraction: An introduction and a survey of current approaches", Journal of Information Science, 36(3), p. 306 - 323, (retrieved: 18.06.2012).
  11. ^ Cunningham, Hamish (2005). "Information Extraction, Automatic", Encyclopedia of Language and Linguistics, 2, p. 665 - 677, (retrieved: 18.06.2012).
  12. ^ Erdmann, M.; Maedche, Alexander; Schnurr, H.-P.; Staab, Steffen (2000). "From Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools", Proceedings of the COLING, (retrieved: 18.06.2012).
  13. ^ Rao, Delip; McNamee, Paul; Dredze, Mark (2011). "Entity Linking: Finding Extracted Entities in a Knowledge Base", Multi-source, Multi-lingual Information Extraction and Summarization, (retrieved: 18.06.2012).
  14. ^ Rocket Software, Inc. (2012). "technology for extracting intelligence from text", (retrieved: 18.06.2012).
  15. ^ Orchestr8 (2012): "AlchemyAPI Overview", (retrieved: 18.06.2012).
  16. ^ The University of Sheffield (2011). "ANNIE: a Nearly-New Information Extraction System", (retrieved: 18.06.2012).
  17. ^ ILP Network of Excellence. "ASIUM (LRI)", (retrieved: 18.06.2012).
  18. ^ Attensity (2012). "Exhaustive Extraction", (retrieved: 18.06.2012).
  19. ^ Mendes, Pablo N.; Jakob, Max; Garcia-Sílva, Andrés; Bizer; Christian (2011). "DBpedia Spotlight: Shedding Light on the Web of Documents", Proceedings of the 7th International Conference on Semantic Systems, p. 1 - 8, (retrieved: 18.06.2012).
  20. ^ Cite error: The named reference entityclassifier was invoked but never defined (see the "help page).
  21. ^ Balakrishna, Mithun; Moldovan, Dan (2013). "Automatic Building of Semantically Rich Domain Models from Unstructured Data", Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference (FLAIRS), p. 22 - 27, (retrieved: 11.08.2014)
  22. ^ 2. Moldovan, Dan; Blanco, Eduardo (2012). "Polaris: Lymba's Semantic Parser", Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC), p. 66 - 72, (retrieved: 11.08.2014)
  23. ^ Adrian, Benjamin; Maus, Heiko; Dengel, Andreas (2009). "iDocument: Using Ontologies for Extracting Information from Text", (retrieved: 18.06.2012).
  24. ^ SRA International, Inc. (2012). "NetOwl Extractor", (retrieved: 18.06.2012).
  25. ^ Fortuna, Blaz; Grobelnik, Marko; Mladenic, Dunja (2007). "OntoGen: Semi-automatic Ontology Editor", Proceedings of the 2007 conference on Human interface, Part 2, p. 309 - 318, (retrieved: 18.06.2012).
  26. ^ Missikoff, Michele; Navigli, Roberto; Velardi, Paola (2002). "Integrated Approach to Web Ontology Learning and Engineering", Computer, 35(11), p. 60 - 63, (retrieved: 18.06.2012).
  27. ^ McDowell, Luke K.; Cafarella, Michael (2006). "Ontology-driven Information Extraction with OntoSyphon", Proceedings of the 5th international conference on The Semantic Web, p. 428 - 444, (retrieved: 18.06.2012).
  28. ^ Yildiz, Burcu; Miksch, Silvia (2007). "ontoX - A Method for Ontology-Driven Information Extraction", Proceedings of the 2007 international conference on Computational science and its applications, 3, p. 660 - 673, (retrieved: 18.06.2012).
  29. ^ (2011). "PoolParty Extractor", (retrieved: 18.06.2012).
  30. ^ Dill, Stephen; Eiron, Nadav; Gibson, David; Gruhl, Daniel; Guha, R.; Jhingran, Anant; Kanungo, Tapas; Rajagopalan, Sridhar; Tomkins, Andrew; Tomlin, John A.; Zien, Jason Y. (2003). "SemTag and Seeker: Bootstraping the Semantic Web via Automated Semantic Annotation", Proceedings of the 12th international conference on World Wide Web, p. 178 - 186, (retrieved: 18.06.2012).
  31. ^ Uren, Victoria; Cimiano, Philipp; Iria, José; Handschuh, Siegfried; Vargas-Vera, Maria; Motta, Enrico; Ciravegna, Fabio (2006). "Semantic annotation for knowledge management: Requirements and a survey of the state of the art", Web Semantics: Science, Services and Agents on the World Wide Web, 4(1), p. 14 - 28,, (retrieved: 18.06.2012).
  32. ^ Cimiano, Philipp; Völker, Johanna (2005). "Text2Onto - A Framework for Ontology Learning and Data-Driven Change Discovery", Proceedings of the 10th International Conference of Applications of Natural Language to Information Systems, 3513, p. 227 - 238, (retrieved: 18.06.2012).
  33. ^ Maedche, Alexander; Volz, Raphael (2001). "The Ontology Extraction & Maintenance Framework Text-To-Onto", Proceedings of the IEEE International Conference on Data Mining, (retrieved: 18.06.2012).
  34. ^ Machine Linking. "We connect to the Linked Open Data cloud", (retrieved: 18.06.2012).
  35. ^ Inxight Federal Systems (2008). "Inxight ThingFinder and ThingFinder Professional", (retrieved: 18.06.2012).
  36. ^ Frawley William. F. et al. (1992), "Knowledge Discovery in Databases: An Overview", AI Magazine (Vol 13, No 3), 57-70 (online full version:
  37. ^ Fayyad U. et al. (1996), "From Data Mining to Knowledge Discovery in Databases", AI Magazine (Vol 17, No 3), 37-54 (online full version:
  38. ^ Cao, L. (2010). "Domain driven data mining: challenges and prospects". IEEE Trans. on Knowledge and Data Engineering. 22 (6): 755–769. "doi:10.1109/tkde.2010.32. 

Cite error: A "list-defined reference named "IMT_Holdings" is not used in the content (see the "help page).

) )