Inside–outside–beginning (tagging)
The IOB format (or BIO format) is a common tagging format for tagging tokens in a chunking task in computational linguistics, especially in named-entity recognition (NER). IOB stands for "inside, outside, beginning". It was proposed in 1995.[1][2]
Example
[edit]A sentence can be parsed in many ways. Usually, a full parsing would result in a parse tree. In a tree, a constituent can contain other constituents. In chunking, a sentence is not parsed into a tree with overlapping constituents, but into non-overlapping "chunks".
For example, in a sentence "The morning flight from Denver has arrived", chunked into noun phrases (NP), verb phrases (VP), and prepositional phrases (PP).
The morning flight from Denver has arrived
The I- prefix before a tag indicates that the tag is inside a chunk. The B- prefix before a tag indicates that the tag is the beginning of a chunk. Whereas, to chunk only the NP, would result in:
The morning flight from Denver has arrived
Here, all the tokens outside of NP are tagged as "O" for "outside".The same example after filtering out stop words: Related tagging schemes sometimes include BIOES: This consists of the tags B, E, I, S or O where S is used to represent a chunk containing a single token. Chunks of length greater than or equal to two always start with the B tag and end with the E tag.[3]
The morning flight from Denver has arrived
This is also called BILOU, where "E" becomes "L" ("last") and "S" becomes "U" ("unit"). This is also called BMEWO, where "I" becomes "M" ("middle") and "S" becomes "W" ("whole").
Drawbacks
[edit]IOB syntax does not permit any nesting, so cannot (unless extended) also represent even very simple phenomena such as sentence boundaries (which are not trivial to locate reliably), the scope of parenthetical expressions in sentences, grammatical structures, nested Named Entities such as "University of Wisconsin Dept. of Computer Science", and so on. It also leaves no place for metadata such as an identifier for the particular sample, the confidence level of the NER assignment, and so on, which are commonplace in NLP systems.
Because of these limitations, data must often be converted out of IOB format, or projects must create custom extensions, which has led to a large number of not-quite-interoperable "IOB-like" formats. Many extended variations will also "pass" a non-extended parser, so it is easy to process incorrectly without noticing.
The space and "O" (meaning "not in any chunk") convey no information and could simply be omitted. The same is true for putting the "type" suffix on "I-" or "E-" markers as in some variants of "BIOES"; and for marking both "I" and "E" (if you have begun and not ended you are "in", and if you are "in", you have begun and not ended). Some other formats deploy verbosity to improve readability and/or error-checking, but no such benefits appear to come to IOB in exchange for its verbosity.
IOB's "one token per tag" depends on the tokenization used, even though tokenization is not standardized in NLP, and details of tokenization do not have to be entangled with the representations of NERs. "11/31/2019" could be anywhere from one to five tokens in different systems, but the NER is the same. Some systems even permit whitespace within tokens, and space as a delimiter collides with this, narrowing the applicability of IOB and motivating more extensions. "space" might or might not include tab, multiple spaces, hard spaces, and so on, differences which are difficult to detect when proofreading.
IOB variants that allow multiple tokens per tag often use "/" or another reserved character to separate the tag from the token. This effectively "reserves" that character, which then cannot occur in tokens, or must be escaped, introducing more incompatibilities.
IOB files have no place to put commonly-needed meta-data, such as the character encoding being used, the data source, internal location-markers, and so on.
XML format
[edit]More powerful formats (like XML, JSON or s-expressions) can handle far more diverse annotations, have far less variation between implementations, and are often shorter and more readable as well. For example:
<PER>Alex</PER> is going with <PER>Marty A. Rick </PER>to<LOC> Los Angeles</LOC>
It also supports sentence boundaries, part-of-speech annotations, location markers, and other features commonly needed in NLP systems. Breaking all tokens in particular places is not strictly part of the NER task; but if every token were tagged (like "<T>is</T>"):
<PER><T>Alex</T></PER><T>is</T><T>going</T><T>with</T><PER><T>Marty</T><T>A.</T><T>Rick</T></PER><T>to</T><LOC><T>Los</T><T>Angeles</T></LOC>
References
[edit]- Jurafsky, Daniel; Martin, James H. (2008). "13.5 Partial Parsing". Speech and Language Processing (2nd ed.). Upper Saddle River, N.J.: Prentice Hall. ISBN 978-0131873216.
- ^ Ramshaw, Lance; Marcus, Mitch (1995). "Text Chunking using Transformation-Based Learning". Third Workshop on Very Large Corpora.
- ^ Ramshaw, L. A.; Marcus, M. P. (1999), Armstrong, Susan; Church, Kenneth; Isabelle, Pierre; Manzi, Sandra (eds.), "Text Chunking Using Transformation-Based Learning", Natural Language Processing Using Very Large Corpora, Dordrecht: Springer Netherlands, pp. 157–176, doi:10.1007/978-94-017-2390-9_10, ISBN 978-94-017-2390-9, retrieved 29 May 2025
- ^ http://cs229.stanford.edu/proj2005/KrishnanGanapathy-NamedEntityRecognition.pdf Archived 11 July 2019 at the Wayback Machine [bare URL PDF]
External links
[edit]- Bob Carpenter (2009). "Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO+". Archived from the original on 5 August 2017.