Jump to content

Part-of-speech tagging: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Tag: Reverted
Tags: Mobile edit Mobile web edit Advanced mobile edit
 
(44 intermediate revisions by 19 users not shown)
Line 1: Line 1:
{{Short description|Identifying parts of speech in a text corpus}}
In [[corpus linguistics]], '''part-of-speech tagging''' ('''POS tagging''' or '''PoS tagging''' or '''POST'''), also called '''[[grammar|grammatical]] tagging''' is the process of marking up a word in a text (corpus) as corresponding to a particular [[parts of speech|part of speech]],<ref>{{cite web |url=https://www.sketchengine.eu/pos-tags/ |title=POS tags |author=<!--Not stated--> |date=2018-03-27 |website=[[Sketch Engine]] |publisher=Lexical Computing |access-date=2018-04-06 |quote=}}</ref> based on both its definition and its [[context (language use)|context]].
{{More citations needed|date=March 2021}}
In [[corpus linguistics]], '''part-of-speech tagging''' ('''POS tagging''' or '''PoS tagging''' or '''POST'''), also called '''grammatical tagging''' is the process of marking up a word in a text (corpus) as corresponding to a particular [[parts of speech|part of speech]],<ref>{{cite web |url=https://www.sketchengine.eu/pos-tags/ |title=POS tags |author=<!--Not stated--> |date=2018-03-27 |website=[[Sketch Engine]] |publisher=Lexical Computing |access-date=2018-04-06 }}</ref> based on both its definition and its [[context (language use)|context]].
A simplified form of this is commonly taught to school-age children, in the identification of words as [[noun]]s, [[verb]]s, [[adjective]]s, [[adverb]]s, etc.
A simplified form of this is commonly taught to school-age children, in the identification of words as [[noun]]s, [[verb]]s, [[adjective]]s, [[adverb]]s, etc.


Line 5: Line 7:


==Principle==
==Principle==
{{unsourced|section|date=May 2023}}
Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in [[natural language]]s (as opposed to many [[Constructed language|artificial language]]s), a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:
Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex. This is not rare—in [[natural language]]s (as opposed to many [[Constructed language|artificial language]]s), a large percentage of word-forms are [[semantic ambiguity|ambiguous]]. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:
: The sailor dogs the hatch.
: The sailor dogs the hatch.
Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; [[Semantic analysis (linguistics)|semantic analysis]] can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a [[seamanship|nautical]] term meaning "fastens (a watertight door) securely").
Correct [[grammar|grammatical]] tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; [[Semantic analysis (linguistics)|semantic analysis]] can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a [[seamanship|nautical]] term meaning "fastens (a watertight door) securely").


===Tag sets===
===Tag sets===
Line 16: Line 19:
The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the [[Eagles Guidelines]] see wide use and include versions for multiple languages.
The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the [[Eagles Guidelines]] see wide use and include versions for multiple languages.


POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as [[Greek language|Greek]] and [[Latin]] can be very large; tagging ''words'' in [[agglutinative language]]s such as [[Inuit languages]] may be virtually impossible. At the other extreme, Petrov et al.<ref>{{cite arXiv |last1=Petrov |first1=Slav|last2=Das|first2=Dipanjan|last3=McDonald|first3=Ryan|eprint=1104.2086 |title=A Universal Part-of-Speech Tagset |class= cs.CL|date=11 Apr 2011}}</ref> have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition (hardly a "universal" coincidence), etc.). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.
POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as [[Greek language|Greek]] and [[Latin]] can be very large; tagging ''words'' in [[agglutinative language]]s such as [[Inuit languages]] may be virtually impossible. At the other extreme, Petrov et al.<ref>{{cite arXiv |last1=Petrov |first1=Slav|last2=Das|first2=Dipanjan|last3=McDonald|first3=Ryan|eprint=1104.2086 |title=A Universal Part-of-Speech Tagset |class= cs.CL|date=11 Apr 2011}}</ref> have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, and so on). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.


==History==
==History==


===The Brown Corpus===
===The Brown Corpus===
{{unsourced|section|date=May 2023}}
Research on part-of-speech tagging has been closely tied to [[corpus linguistics]]. The first major corpus of English for computer analysis was the [[Brown Corpus]] developed at [[Brown University]] by [[Henry Kučera]] and [[W. Nelson Francis]], in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).
Research on part-of-speech tagging has been closely tied to [[corpus linguistics]]. The first major corpus of English for computer analysis was the [[Brown Corpus]] developed at [[Brown University]] by [[Henry Kučera]] and [[W. Nelson Francis]], in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).


The [[Brown Corpus]] was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, the article then nouns can occur, but the article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree).
The [[Brown Corpus]] was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article then verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree).


This corpus has been used for innumerable studies of word-frequency and of part-of-speech and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as [[CLAWS (linguistics)]] and [[VOLSUNGA]]. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word [[British National Corpus]].
This corpus has been used for innumerable studies of word-frequency and of part-of-speech and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as [[CLAWS (linguistics)|CLAWS]] and [[VOLSUNGA]]. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word [[British National Corpus]], even though larger corpora are rarely so thoroughly curated.


For some time, part-of-speech tagging was considered an inseparable part of [[natural language processing]], because there are certain cases where the correct part of speech cannot be decided without understanding the [[semantics]] or even the [[pragmatics]] of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
For some time, part-of-speech tagging was considered an inseparable part of [[natural language processing]], because there are certain cases where the correct part of speech cannot be decided without understanding the [[semantics]] or even the [[pragmatics]] of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
Line 36: Line 40:
When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with the highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this and achieved accuracy in the 93–95% range.
When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with the highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this and achieved accuracy in the 93–95% range.


It is worth remembering, as [[Eugene Charniak]] points out in ''Statistical techniques for natural language parsing'' (1997),<ref>[http://www.cs.brown.edu/people/ec/home.html Eugene Charniak]</ref> that merely assigning the most common tag to each known word and the tag "[[proper noun]]" to all unknowns will approach 90% accuracy because many words are unambiguous, and many others only rarely represent their less-common parts of speech.
[[Eugene Charniak]] points out in ''Statistical techniques for natural language parsing'' (1997)<ref>[http://www.cs.brown.edu/people/ec/home.html Eugene Charniak]</ref> that merely assigning the most common tag to each known word and the tag "[[proper noun]]" to all unknowns will approach 90% accuracy because many words are unambiguous, and many others only rarely represent their less-common parts of speech.


CLAWS pioneered the field of HMM-based part of speech tagging but were quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech (DeRose 1990, p.&nbsp;82)).
CLAWS pioneered the field of HMM-based part of speech tagging but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech.<ref>DeRose 1990, p.&nbsp;82.</ref>


HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.<ref>[http://yatsko.zohosites.com/cll-tagger.html CLL POS-tagger]</ref>
HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.<ref>[http://yatsko.zohosites.com/cll-tagger.html CLL POS-tagger]</ref>


===Dynamic programming methods===
===Dynamic programming methods===
In 1987, [[Steven DeRose]]<ref>DeRose, Steven J. 1988. "Grammatical category disambiguation by statistical optimization." Computational Linguistics 14(1): 31&ndash;39. [http://portal.acm.org/citation.cfm?id=49087&CFID=44218900&CFTOKEN=10419730]</ref> and [[Kenneth W. Church|Ken Church]]<ref>{{cite journal|author=Kenneth Ward Church|year=1988|title=A stochastic parts program and noun phrase parser for unrestricted text|journal=ANLC '88: Proceedings of the Second Conference on Applied Natural Language Processing. Association for Computational Linguistics Stroudsburg, PA|pages=136|doi=10.3115/974235.974260|doi-access=free}}</ref> independently developed [[dynamic programming]] algorithms to solve the same problem in vastly less time. Their methods were similar to the [[Viterbi algorithm]] known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (an actual measurement of triple probabilities would require a much larger corpus). Both methods achieved an accuracy of over 95%. DeRose's 1990 dissertation at [[Brown University]] included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.
In 1987, [[Steven DeRose]]<ref>{{ cite journal| last=DeRose| first= Steven J. |date=1988|title=Grammatical category disambiguation by statistical optimization |work= Computational Linguistics|volume =14|issue=1|pages=31–39}}</ref> and Kenneth W. Church<ref>{{cite conference|author=Kenneth Ward Church| chapter=A stochastic parts program and noun phrase parser for unrestricted text |year=1988|title=ANLC '88: Proceedings of the Second Conference on Applied Natural Language Processing| publisher= Association for Computational Linguistics |editor= Norm Sondheimer |page =136|doi=10.3115/974235.974260|doi-access=free}}</ref> independently developed [[dynamic programming]] algorithms to solve the same problem in vastly less time. Their methods were similar to the [[Viterbi algorithm]] known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (an actual measurement of triple probabilities would require a much larger corpus). Both methods achieved an accuracy of over 95%. DeRose's 1990 dissertation at [[Brown University]] included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.


These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated from the other levels of processing; this, in turn, simplified the theory and practice of computerized language analysis and encouraged researchers to find ways to separate other pieces as well. Markov Models are now the standard method for the part-of-speech assignment.
These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated from the other levels of processing; this, in turn, simplified the theory and practice of computerized language analysis and encouraged researchers to find ways to separate other pieces as well. Markov Models became the standard method for the part-of-speech assignment.


===Unsupervised taggers===
===Unsupervised taggers===
{{unsourced|section|date=May 2023}}
The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to [[Bootstrapping (linguistics)|bootstrap]] using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.
The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to [[Bootstrapping (linguistics)|bootstrap]] using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.


Line 54: Line 59:
===Other taggers and methods===
===Other taggers and methods===


Some current major algorithms for part-of-speech tagging include the [[Viterbi algorithm]], [[Brill tagger]], [[Constraint Grammar]], and the [[Baum-Welch algorithm]] (also known as the forward-backward algorithm). Hidden Markov model and [[Markov model|visible Markov model]] taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity. Unlike the Brill tagger where the rules are ordered sequentially, the POS and morphological tagging toolkit [http://rdrpostagger.sourceforge.net/ RDRPOSTagger] stores rule in the form of a [[ripple-down rules]] tree.
Some current major algorithms for part-of-speech tagging include the [[Viterbi algorithm]], [[Brill tagger]], [[Constraint Grammar]], and the [[Baum-Welch algorithm]] (also known as the forward-backward algorithm). Hidden Markov model and [[Markov model|visible Markov model]] taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity.


Many [[machine learning]] methods have also been applied to the problem of POS tagging. Methods such as [[Support vector machine|SVM]], [[maximum entropy classifier]], [[perceptron]], and [[K-nearest neighbor algorithm|nearest-neighbor]] have all been tried, and most can achieve accuracy above 95%.
Many [[machine learning]] methods have also been applied to the problem of POS tagging. Methods such as [[Support vector machine|SVM]], [[maximum entropy classifier]], [[perceptron]], and [[K-nearest neighbor algorithm|nearest-neighbor]] have all been tried, and most can achieve accuracy above 95%.{{citation needed|date=September 2022}}


A direct comparison of several methods is reported (with references) at the ACL Wiki.<ref>[https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) POS Tagging (State of the art)]</ref> This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable. However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported here are the best that can be achieved with a given approach; nor even the best that ''have'' been achieved with a given approach.
A direct comparison of several methods is reported (with references) at the ACL Wiki.<ref>[https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art) POS Tagging (State of the art)]</ref> This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable. However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported here are the best that can be achieved with a given approach; nor even the best that ''have'' been achieved with a given approach.


In 2014, a paper reporting using the [[structure regularization method]] for part-of-speech tagging, achieving 97.36% on the standard benchmark dataset.<ref>{{cite conference |author1=Xu Sun |title=Structure Regularization for Structured Prediction |conference=Neural Information Processing Systems (NIPS) |year=2014 |pages=2402–2410 |url=http://papers.nips.cc/paper/5643-structure-regularization-for-structured-prediction.pdf |access-date=2014-11-26 |archive-url=https://web.archive.org/web/20160403001850/http://papers.nips.cc/paper/5643-structure-regularization-for-structured-prediction.pdf |archive-date=2016-04-03 |url-status=dead }}</ref>
In 2014, a paper reporting using the [[structure regularization method]] for part-of-speech tagging, achieving 97.36% on a standard benchmark dataset.<ref>{{cite conference|author1=Xu Sun|year=2014|title=Structure Regularization for Structured Prediction|url=https://proceedings.neurips.cc/paper/2014/file/838e8afb1ca34354ac209f53d90c3a43-Paper.pdf|conference=Neural Information Processing Systems (NIPS)|pages=2402–2410|archive-url=|archive-date=|access-date=2021-08-20|url-status=}}</ref>

==Issues==

While there is broad agreement about basic categories, several edge cases make it difficult to settle on a single "correct" set of tags, even in a particular language such as (say) English. For example, it is hard to say whether "fire" is an adjective or a noun in

the big green fire truck

A second important example is the [[use/mention distinction]], as in the following example, where "blue" could be replaced by a word from any POS (the Brown Corpus tag set appends the suffix "-NC" in such cases):

the word "blue" has 4 letters.

Words in a language other than that of the "main" text are commonly tagged as "foreign", usually, in addition to a tag for the role the foreign word is playing in context.

There are also many cases where POS categories and "words" do not map one to one, for example:

as far as
David's
gonna
don't
vice versa
first-cut
cannot
pre- and post-secondary
look (a word) up

In the last example, "look" and "up" [[phrasal verb|combine to function as a single verbal unit]], despite the possibility of other words coming between them. Some tag sets (such as Penn) break hyphenated words, contractions, and possessives into separate tokens, thus avoiding some but far from all such problems.

Many tag sets treat words such as "be", "have", and "do" as categories in their own right (as in the Brown Corpus), while a few treat them all as simply verbs (for example, the LOB Corpus and the Penn [[Treebank]]). Because these particular words have more forms than other English verbs, and occur in quite different grammatical contexts, treating them merely as "verbs" means that a POS tagger has much less information to go on. For example, an HMM-based tagger would combine several rows and columns{{huh|date=April 2019}} that would otherwise be not only distinct but quite different. A more complex algorithm could also consider the particular word in each case; but with distinct tags, the HMM itself can often predict the correct finer-grained tag even for novel spelling variants, and thus provide better help to later processing.

A different issue is that some cases are ambiguous. [[Beatrice Santorini]] gives examples in "Part-of-speech Tagging Guidelines for the Penn Treebank Project", (3rd rev, June 1990 [ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz]), including the following (p.&nbsp;32) case in which ''entertaining'' can be either an adjective or a verb, and there is no syntactic way to decide:

The Duchess was ''entertaining'' last night.


==See also==
==See also==
Line 102: Line 75:
==References==
==References==
<references />
<references />

===Works cited===
*Charniak, Eugene. 1997. "[http://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1320/1221 Statistical Techniques for Natural Language Parsing]". ''AI Magazine'' 18(4):33&ndash;44.
*Charniak, Eugene. 1997. "[http://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1320/1221 Statistical Techniques for Natural Language Parsing]". ''AI Magazine'' 18(4):33&ndash;44.
*Hans van Halteren, Jakub Zavrel, [[Walter Daelemans]]. 2001. Improving Accuracy in NLP Through Combination of Machine Learning Systems. ''Computational Linguistics''. 27(2): 199&ndash;229. [https://web.archive.org/web/20031005072950/http://acl.ldc.upenn.edu/J/J01/J01-2002.pdf PDF]
*Hans van Halteren, Jakub Zavrel, [[Walter Daelemans]]. 2001. Improving Accuracy in NLP Through Combination of Machine Learning Systems. ''Computational Linguistics''. 27(2): 199&ndash;229. [https://web.archive.org/web/20031005072950/http://acl.ldc.upenn.edu/J/J01/J01-2002.pdf PDF]
*DeRose, Steven J. 1990. "Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages." Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences. Electronic Edition available at [http://www.derose.net/steve/writings/dissertation/Diss.0.html]
*DeRose, Steven J. 1990. "Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages." Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences. Electronic Edition available at [http://www.derose.net/steve/writings/dissertation/Diss.0.html]
* D.Q. Nguyen, D.Q. Nguyen, D.D. Pham and S.B. Pham (2016). "A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging." ''AI Communications'', vol. 29, no. 3, pages 409-422. [https://arxiv.org/abs/1412.4021 [.pdf]]
* D.Q. Nguyen, D.Q. Nguyen, D.D. Pham and S.B. Pham (2016). "A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging." ''AI Communications'', vol. 29, no. 3, pages 409–422. [https://arxiv.org/abs/1412.4021 &#91;.pdf&#93;]


==External links==
{{toomanylinks|date=January 2016}}
*[[Spark NLP]] offers pre-trained POS taggers for more than 30 languages<ref>{{cite web |last1=Models Hub |url=https://nlp.johnsnowlabs.com/models?tag=pos |accessdate=12 October 2020}}</ref>.
*[https://github.com/datquocnguyen/jPTDP jPTDP] provides pre-trained models for joint POS tagging and dependency parsing for 40+ languages.
*[http://rdrpostagger.sourceforge.net/ RDRPOSTagger] - a robust toolkit for POS and morphological tagging (Python & Java). RDRPOSTagger supports pre-trained POS tagging models for 40+ languages.
*[https://smile-pos.appspot.com SMILE POS tagger] - free online service, includes an HMM-based POS tagger (Java API)
*[http://www-nlp.stanford.edu/links/statnlp.html#Taggers Overview of available taggers]
*[http://faculty.washington.edu/dillon/GramResources/GramResources.html Resources for Studying English Syntax Online]
*[http://ucrel.lancs.ac.uk/claws/ CLAWS]
*[http://www.alias-i.com/lingpipe LingPipe] Commercial Java natural language processing software including trainable part-of-speech taggers with first-best, n-best and per-tag confidence output.
*[http://opennlp.apache.org/index.html Apache OpenNLP] AL 2.0, includes a POS tagger based on maxent and perceptron classifiers
*[http://crftagger.sourceforge.net/ CRFTagger] Conditional Random Fields (CRFs) English POS Tagger
*[http://jtextpro.sourceforge.net/ JTextPro] A Java-based Text Processing Toolkit
*[https://github.com/danieldk/citar Citar] [[LGPL]] C++ [[Hidden Markov Model]] trigram POS tagger, a [[Java (programming language)|Java]] port named [https://github.com/danieldk/jitar Jitar] is also available
*[https://github.com/chaosprophet/Ninja-PoST Ninja-PoST] PHP port of GPoSTTL, based on Eric Brill's rule-based tagger
*[https://web.archive.org/web/20100110213327/http://www.complexityintelligence.com/en/homepage ComplexityIntelligence, LLC] Free and Commercial NLP Web Services for Part Of Speech Tagging (and Named Entity Recognition)
*[http://pastebin.com/UaT0AneH Part-of-Speech tagging based on Soundex features]
*[http://www.markwatson.com/opensource/ FastTag - LGPL Java POS tagger based on Eric Brill's rule-based tagger]
*[http://code.google.com/p/jspos/ jspos - LGPL Javascript port of FastTag]
*[https://pypi.python.org/pypi/topia.termextract/ Topia TermExtractor - Python implementation of the UPenn BioIE parts-of-speech algorithm]
*[http://nlp.stanford.edu/software/tagger.shtml Stanford Log-linear Part-Of-Speech Tagger]
*[http://morphadorner.northwestern.edu/morphadorner/postagger/ Northwestern MorphAdorner POS Tagger]
*[http://www.molinolabs.com/lematizador.html Part of speech tagger for Spanish]
* [http://okchakko.com/pos-tagger-for-french-language/ Part-of-speech tagger for French]
*[http://www.ling.su.se/english/nlp/tools/stagger/stagger-the-stockholm-tagger Stagger – The Stockholm Tagger, for Swedish]
*[http://www.coli.uni-saarland.de/~thorsten/tnt/ TnT -- Statistical Part-of-Speech Tagging, with one German and one English language model]
*[http://www.opentranslation.es/petratag/index.htm petraTAG Part-of-speech tagger] Open-source POS tagger written in Java with special features for tagging translated texts.
*[http://www.basistech.com/text-analytics/rosette/base-linguistics Rosette linguistics platform] Commercial POS tagger, lemmatizer, base noun phrase extractor and other morphological analysis in Java and C++
*[http://spacy.io spaCy] Open-source (MIT) Python NLP library including trainable part-of-speech tagger
{{Natural Language Processing}}
{{Natural Language Processing}}

[[Category: Corpus linguistics]]
[[Category: Tasks of natural language processing]]
[[Category:Corpus linguistics]]
[[Category:Tasks of natural language processing]]
[[Category:Markov models]]
[[Category:Markov models]]
[[Category:Word-sense disambiguation]]
[[Category:Word-sense disambiguation]]

Latest revision as of 02:30, 11 May 2024

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, by a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

Principle[edit]

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex. This is not rare—in natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:

The sailor dogs the hatch.

Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; semantic analysis can also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and 2) an action applied to the object "hatch" (in this context, "dogs" is a nautical term meaning "fastens (a watertight door) securely").

Tag sets[edit]

Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. In some tagging systems, different inflections of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Other tagging systems use a smaller number of tags and ignore fine differences or model them as features somewhat independent from part-of-speech.[2]

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech and found that about as many words were ambiguous in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as Ncmsan for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.

The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the Eagles Guidelines see wide use and include versions for multiple languages.

POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging words in agglutinative languages such as Inuit languages may be virtually impossible. At the other extreme, Petrov et al.[3] have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, and so on). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.

History[edit]

The Brown Corpus[edit]

Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kučera and W. Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).

The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article then verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree).

This corpus has been used for innumerable studies of word-frequency and of part-of-speech and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus, even though larger corpora are rarely so thoroughly curated.

For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the semantics or even the pragmatics of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.

Use of hidden Markov models[edit]

In the mid-1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English. HMMs involve counting cases (such as from the Brown Corpus) and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can, of course, be used to benefit from knowledge about the following words.

More advanced ("higher-order") HMMs learn the probabilities not only of pairs but triples or even larger sequences. So, for example, if you've just seen a noun followed by a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb.

When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with the highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this and achieved accuracy in the 93–95% range.

Eugene Charniak points out in Statistical techniques for natural language parsing (1997)[4] that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns will approach 90% accuracy because many words are unambiguous, and many others only rarely represent their less-common parts of speech.

CLAWS pioneered the field of HMM-based part of speech tagging but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech.[5]

HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.[6]

Dynamic programming methods[edit]

In 1987, Steven DeRose[7] and Kenneth W. Church[8] independently developed dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the Viterbi algorithm known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (an actual measurement of triple probabilities would require a much larger corpus). Both methods achieved an accuracy of over 95%. DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.

These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated from the other levels of processing; this, in turn, simplified the theory and practice of computerized language analysis and encouraged researchers to find ways to separate other pieces as well. Markov Models became the standard method for the part-of-speech assignment.

Unsupervised taggers[edit]

The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to bootstrap using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.

These two categories can be further subdivided into rule-based, stochastic, and neural approaches.

Other taggers and methods[edit]

Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill tagger, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity.

Many machine learning methods have also been applied to the problem of POS tagging. Methods such as SVM, maximum entropy classifier, perceptron, and nearest-neighbor have all been tried, and most can achieve accuracy above 95%.[citation needed]

A direct comparison of several methods is reported (with references) at the ACL Wiki.[9] This comparison uses the Penn tag set on some of the Penn Treebank data, so the results are directly comparable. However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported here are the best that can be achieved with a given approach; nor even the best that have been achieved with a given approach.

In 2014, a paper reporting using the structure regularization method for part-of-speech tagging, achieving 97.36% on a standard benchmark dataset.[10]

See also[edit]

References[edit]

  1. ^ "POS tags". Sketch Engine. Lexical Computing. 2018-03-27. Retrieved 2018-04-06.
  2. ^ Universal POS tags
  3. ^ Petrov, Slav; Das, Dipanjan; McDonald, Ryan (11 Apr 2011). "A Universal Part-of-Speech Tagset". arXiv:1104.2086 [cs.CL].
  4. ^ Eugene Charniak
  5. ^ DeRose 1990, p. 82.
  6. ^ CLL POS-tagger
  7. ^ DeRose, Steven J. (1988). "Grammatical category disambiguation by statistical optimization". Computational Linguistics. 14 (1): 31–39.
  8. ^ Kenneth Ward Church (1988). "A stochastic parts program and noun phrase parser for unrestricted text". In Norm Sondheimer (ed.). ANLC '88: Proceedings of the Second Conference on Applied Natural Language Processing. Association for Computational Linguistics. p. 136. doi:10.3115/974235.974260.
  9. ^ POS Tagging (State of the art)
  10. ^ Xu Sun (2014). Structure Regularization for Structured Prediction (PDF). Neural Information Processing Systems (NIPS). pp. 2402–2410. Retrieved 2021-08-20.

Works cited[edit]

  • Charniak, Eugene. 1997. "Statistical Techniques for Natural Language Parsing". AI Magazine 18(4):33–44.
  • Hans van Halteren, Jakub Zavrel, Walter Daelemans. 2001. Improving Accuracy in NLP Through Combination of Machine Learning Systems. Computational Linguistics. 27(2): 199–229. PDF
  • DeRose, Steven J. 1990. "Stochastic Methods for Resolution of Grammatical Category Ambiguity in Inflected and Uninflected Languages." Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences. Electronic Edition available at [1]
  • D.Q. Nguyen, D.Q. Nguyen, D.D. Pham and S.B. Pham (2016). "A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging." AI Communications, vol. 29, no. 3, pages 409–422. [.pdf]