UNIT 5 APPLICATIONS

AI applications – Language Models – Information Retrieval- Information Extraction – Natural Language Processing – Machine Translation – Speech Recognition.

5.1 LANGUAGE MODELS

Language can be defined as a set of strings; “print(2+2)” is a legal program in the language Python, where “2) + (2 print” is not. Since the are an infinite number of legal programs, they cannot be enumerated; instead they are specified by a set of rules called a grammar. Formal languages also have rules that defined the meaning semantics of a program; for example, the rules say that the “meaning” of “2 + 2” is 4, and the meaning of “1/0” is that an error is signated.

1. Natural languages, such an English or Spanish, cannot be characterized as a definite set of sentences. Example: Everyone agrees that “Not to be invited is sad” is a sentence of English, but people disagree on the grammatically of “To be not invited is said”. Therefore, it is more fruitful to define a natural language model as a probability distribution over sentences rather than a definitive set. That is, rather than asking if a string of words is or is not a member of the set defining the language, we instead ask for P(S = word) - what is the probability that a random sentence would to words. Natural languages are also ambiguous. “He saw her duck” can mean either that he saw a waterfowl belonging to her, or that he saw her move to evade something. Thus, again, we cannot speak of a single meaning for a sentence, but rather of a probability distribution over possible meanings.

2. Finally, natural language are difficult to deal with because they are very large, and constantly changing. Thus, our language models are, at best, an approximation. We start with simplest possible approximation and move up from there.

5.2 N-GRAM CHARACTER MODELS

1. An n-gram model is defined as a Markov chain of order n – 1. In Markov chain the probability of character ci depends only on the immediately preceding characters, not on any other characters. So in a trigram model (Markov chain of order 2) we have

P(c4 | c1:i-1) = P(c4 | c4 – 2:i -1)

We can define the probability of a sequence of characters P(c1:N) under the trigram model by first factoring with the chain rule and then using the Markov assumption:

P(C1 : N) = Õ

i=1

P(c1 | c1:i-1) = Õ

i=1

P(ci | ci-2:i-1)

For a trigram character model in a language with 100 characters, P(Ci|Ci-2:i=1) has a million entries, and can be accurately estimated by counting character sequences in a body of text of 10 million characters or more. We call a body of text a corpus (plural corpora), from the Latin word for body.

2. Language identification: Given a text, determine what natural language it is written in. This is a relatively easy task, even with short texts such as “Hello, world” or “Wiegehtesdir”, it is easy to identify the first as English and the second as German. Computer systems identify languages with greater than 99% accuracy; occasionally, closely related languages, such as Swedish and Norwegian, are confused.

3. One approach to language identification is to first build a trigram character model of each candidate language, P (ci | ci-2; i-1), where the variable_ranges over languages. For each_the model is built by counting trigrams in a corpus of that language. (About 100,000 characters of each language are needed). That gives us a model of P (Text | Language), but we want to select the most probable language given the text, so we apply Bayes’ rule followed by the Markov assumption to get the most probable language:

𝑙* =

argmax

𝑙

argmaxP(𝑙 | e₁ : N)

= argmax P(𝑙)O(C

𝑙)

𝑙

= argmax P( ) N

1:N

𝑙 Õ

P(c4

| c4

- 2 : i -1,

𝑙)

𝑙 i=1

4. Other tasks for character models include spelling correction, genre classification, and named-entity recognition. Genre classification means decided if a text is a new story, a legal document, a scientific article, etc. While many features help make this classification, counts of punctuation and other character n-gram features go a long way (Kessler et al., 1997).

5. Named-entity recognition is the task of finding names of things in a document and deciding what class they belong to. For example, in the text “MrSopersteen was prescribed aciphex”, we should recognize that “Mr. Sopersteen” is the name of a person and “aciphex” is the name of a drug. Character-level models are good for this task because they can associate the character sequence “ex” (“ex” followed by a space) with a drug name and “steen” with a person name, and thereby identify words that they have never seen before.

5.3 SMOOTHING N-GRAM MODELS

1. The major complication of n-gram models is that the training corpus provides only an estimate of the true probability distribution.

2. For common character sequence such as “th” any English corpus will give a good estimate: about 1.5% of all trigrams.

3. On the other hand, “ht” is very uncommon – no dictionary words start5 with ht. It is likely that the sequence would

4. Have a count of zero in a training corpus of standard English. Does that mean we should assign P(“th”) = 0? If we did, then the text “The program issues an http request” would have an English probability of zero, which seems wrong.

5. The process adjusting the probability of low-frequency counts is called smoothing.

6. A better approach is a backoff model, in which we start by estimating n-gram counts, but for any particular sequence that has a low (or zero) count, we back off to (n-1) grams. Linear interpolation smoothing is a backoff model that combines trigram, and unigram models by linear interpolation. It defines the probability estimate as

* P(ci|ci-2:i=1) = l3P(ci|ci-2:i-1) + l2P(ci|ci-1) + l1P(ci), where l3 + l2 + l1 = 1.

It is also possible to have the values of lI depend on the counts: if we have a high count of trigrams, then we weigh them relatively more; if only a low count, then we put more weight on the bigram and unigram models.

5.4 MODEL EVALUATION

1. The evaluation can be a task-specific metric, such as meaning accuracy on language identification.

2. Alternatively, we can have a task-independent model of language quality: calculate the probability assigned to the validation corpus by the model; the higher the probability the better. This metric is inconvenient because the probability of a large corpus will be a very small number, and floating-point underflow becomes an issue. A different way of describing the probability of a sequence is with a measure called perplexity, defined as

Perplexity (c1:N) = P(c1:N)^-1/N

3. Perplexity can be thought of as the reciprocal of probability, normalized by sequence length.

4. It can also be thought of as the weighted average branching factor5 of a model. Suppose there are 100 characters in our language, and our model says they are all equally likely.

Then for a sequence of any length, the perplexity will be 100. If some characters are more likely than others, and the model reflects that, then the model will have a perplexity less than 100.

5.5 N-GRAM WORD MODEL

1. n-gram models over words rather than characters

2. All the same mechanism applies equally to word and character models.

3. The main difference is that the vocabulary— the set of symbols that make up the corpus and the model—is larger.

4. There are only about 100 characters in most languages, and sometimes we build character models that are even more restrictive, for example by treating “A” and “a” as the same symbol or by treating all punctuation as the same symbol. But with word models we have at least tens of thousands of symbols, and sometimes millions. The wide range is because it is not clear what constitutes a word.

5. Word n-gram models need to deal with out of vocabulary words.

6. With word models there is always the chance of a new word that was not seen in the training corpus, so we need to model that explicitly in our language model.

7. This can be done by adding just one new word to the vocabulary: <UNK>, standing for the unknown word.

8. Sometimes multiple unknown-word symbols are used, for different classes. For example, any string of digits might be replaced with <NUM>, or any email address with <EMAIL>.

5.6 INFORMATION RETRIEVAL

DEFINITION: Information retrieval is the task of documents that are relevant to a user’s need for information.

The best-known examples of information retrieval systems are search engines on the WorldWideWeb. A Web user can type a query such as [AI book] into a search engine and see a list of relevant pages. In this section, we will see how such systems are built. An information retrieval (henceforth IR) system can be characterized by

1. A corpus of documents. Each system must decide what it wants to treat as a document: a paragraph, a page, or a multipage text.

2. Queries posed in a query language. A query specifies what the user wants to know. The query language can be just a list of words, such as [AI book]; or it can specify a phrase of words that must be adjacent, as in [“AI book”]; it can contain Boolean operators as

in [AI AND book]; it can include non-Boolean operators such as [AI NEAR book] or [AI book site:www.aaai.org].

3. A result set. This is the subset of documents that the IR system judges to be relevant to the query. By relevant, we mean likely to be of use to the person who posed the query, for the particular information need expressed in the query.

4. A presentation of the result set. This can be as simple as a ranked list of document titles or as complex as a rotating color map of the result set projected onto a three- dimensional space, rendered as a two-dimensional display.

The earliest IR systems worked on a Boolean keyword model. Each word in the document collection is treated as a Boolean feature that is true of a document if the word occurs in the document and false if it does not.

Advantage

1. Simple to explain and implement.

Disadvantages

1. The degree of relevance of a document is a single bit, so there is no guidance as to how to order the relevant documents for presentation.

2. Boolean expressions are unfamiliar to users who are not programmers or logicians. Users find it unintuitive that when they want to know about farming in the states of Kansas and Nebraska they need to issue the query [farming (Kansas OR Nebraska)].

3. It can be hard to formulate an appropriate query, even for a skilled user. Suppose we try [information AND retrieval AND models AND optimization] and get an empty result set. We could try [information OR retrieval OR models OR optimization], but if that returns too many results, it is difficult to know what to try next.

5.7 IR SCORING FUNCTIONS

1. Most IR systems have abandoned the Boolean model and use models based on the statistics of word counts.

2. A scoring function takes a document and a query and returns a numeric score; the most relevant documents have the highest scores

3. In the BM25 scoring function, the score is a linear weighted combination of scores for each of the words that make up the query

4. Three factors affect the weight of a query term:

· First, the frequency with which a query term appears in a document (also known as TF for term frequency). For the query [farming in Kansas], documents that mention “farming” frequently will have higher scores.

· Second, the inverse document frequency of the term, or IDF. The word “in” appears in almost every document,

· so it has a high document frequency, and thus a low inverse document frequency, and thus it is not as important to the query as “farming” or “Kansas.”

· Third, the length of the document. A million-word document will probably mention all the query words, but may not actually be about the query. A short document that mentions all the words is a much better candidate.

5. The BM25 function takes all three of these into account

BM25(d , q1: N) = å

IDF(q ).

TF(qi , d j ).(k + 1)

i=1

TF(qi , d j ) + k.(1- b + a.

| d j |

where |dj| is the length of document dj in words, and L is the average document length in the corpus, L = Si |di| N. We have two parameters, k and b, that can be tuned by cross- validation; typical values are k = 2.0 and b = 0.75. IDF (qi) is the inverse document.

6. Systems create an index ahead of time that lists, for each vocabulary word, the documents that contain the word. This is called the hit list for the word. Then when given an query, we intersect the hit lists of the query words and only score the documents in the intersection.

5.8 IR SYSTEM EVALUATION

Imagine the an IR system has returned a result set for a single query, for which we know which documents are and are not relevant, out of a corpus of 100 documents. The document counts in each category are given in the following table.

Table 5.1

	In result set	Not in result set
Relevant	30	20
Not relevant	10	40

Precision measures the proportion of documents in the result set that are actually relevant. In our example, the precision is 30/(30 + 10)=.75. The false positive rate is 1 −

.75=.25. 2. Recall measures the proportion of all the relevant documents in the collection that

are in the result set. In our example, recall is 30/(30 + 20)=.60. The false negative rate is 1

−.60=.40. 3. In a very large document collection, such as the WorldWideWeb, recall is difficult to compute, because there is no easy way to examine every page on the Web for relevance.

5.9 IR REFINEMENTS

1. One common refinement is a better model of the effect of document length on relevance.

2. The BM25 scoring function uses a word model that treats all words as completely independent, but we know that some words are correlated: “couch” is closely related to both “couches” and “sofa.” Many IR systems attempt to account for these correlations.

3. For example, if the query is [couch], it would be a shame to exclude from the result set those documents that mention “COUCH” or “couches” but not “couch.” Most IR systems do case folding of “COUCH” to “couch,” and some use a stemming algorithm to reduce “couches” to the stem form “couch,” both in the query and the documents.

4. The next step is to recognize synonyms, such as “sofa” for “couch.” As with stemming, this has the potential for small gains in recall, but can hurt precision. Synonyms and related words can be found in dictionaries or by looking for correlations in documents or in queries.

5. As a final refinement, IR can be improved by considering metadata—data outside of the text of the document. Examples include human-supplied keywords and publication data. On the Web, hypertext links between documents are a crucial source of information.

5.10 PAGERANKALGORITHM

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. Page Rank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other website

Text Box: Figure 5.1 The HITS algorithm for computing hubs and authorities with respect to a query. REELVANT-PAGES fetches the pages that match the query and EXPAND- PAGES adds in every page that links to or is linked from one of the relevant pages. NORMALIZE divides each page’s score by the sum of the sequence of all pages’ scores (separately for both the authority and hubs scores)

We will see that the recursion bottom out property. The PageRank for a page p is defined as

PR(p) = 1 - d + då

PR(ini )

N _i C(ini )

where PR(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are the pages that link into p, and C(ini) is the count of the total number of out-links on page ini. The constant d is a damping factor. It can be understood through the random surface model: imagine a Web surface who starts at some random page and begins exploring.

5.11 THE HITS ALGORITHM

1. The Hyperlink-Induced Topic Search algorithm, also known as “Hubs and Authorities” or HITS, is another influential link-analysis algorithm.

2. HITS differ from PageRank in several ways. First, it is a query-dependent measure: it rates pages with respect to a query.

3. HITS first find a set of pages that are relevant to the query. It does that by intersecting hit lists of queries

4. Words, and then adding pages in the link neighborhood of these pages—pages that link to or are linked from one of the pages in the original relevant set.

5. Each page in this set is considered an authority on the query to the degree that other pages in the relevant set point to it. A page is considered a hub to the degree that it points to other authoritative pages in the relevant set.

6. Just as with PageRank, we don’t want to merely count the number of links; we want to give more value to the high-quality hubs and authorities.

7. Thus, as with PageRank, we iterate a process that updates the authority score of a page to be the sum of the hub scores of the pages that point to it, and the hub score to be the sum of the authority scores of the pages it points to.

8. Both PageRank and HITS played important roles in developing our understanding of Web information retrieval. These algorithms and their extensions are used in ranking

billions of queries daily as search engines steadily develop better ways of extracting yet finer signals of search relevance.

5.12 IMAGE EXTRACTION

Information extraction is the process of acquiring INFORMATION knowledge by skimming a text and looking for occurrences of a particular class of object and for relationships among objects. A typical task is to extract instances of addresses from Web pages, with database fields for street, city, state, and zip code; or instances of storms from weather reports, with fields for temperature, wind speed, and precipitation. In a limited domain, this can be done with high accuracy.

5.13 FINITE-STATE AUTOMATA FOR INFORMATION EXTRACTION

1. The simplest type of information extraction system is an attribute-based extraction system that assumes that the entire text refers to a single object and the task is to extract attributes of that object.

2. One step up from attribute-based extraction systems are relational extraction systems, which deal with multiple objects and the relations among them.

3. A typical relational-based extraction system is FASTUS, which handles news stories about corporate mergers and acquisitions.

4. A relational extraction system can be built as a series of cascaded finite-state transducers.

5. That is, the system consists of a series of small, efficient finite-state automata (FSAs), where each automaton receives text as input, transduces the text into a different format, and passes it along to the next automaton. FASTUS consists of five stages: 1. Tokenization 2. Complex-word handling 3. Basic-group handling 4. Complex-phrase handling 5. Structure merging

6. FASTUS’s first stage is tokenization, which segments the stream of characters into tokens (words, numbers, and punctuation). For English, tokenization can be fairly simple; just separating characters at white space or punctuation does a fairly good job. Some tokenizers also deal with markup languages such as HTML, SGML, and XML.

7. The second stage handles complex words, including collocations such as “set up” and “joint venture,” as well as proper names such as “Bridgestone Sports Co.” These are recognized by a combination of lexical entries and finite- state grammar rules.

8. The third stage handles basic groups, meaning noun groups and verb groups. The idea is to chunk these into units that will be managed by the later stages.

9. The fourth stage combines the basic groups into complex phrases. Again, the aim is to have rules that are finite- state and thus can be processed quickly, and that result in

unambiguous (or nearly unambiguous) output phrases. One type of combination rule deals with domain-specific events.

10. The final stage merges structures that were built up in the previous step. If the next sentence says “The joint venture will start production in January,” then this step will notice that there are two references to a joint venture, and that they should be merged into one. This is an instance of the identity uncertainty problem.

5.14 PROBABILISTIC MODELS FOR INFORMATION EXTRACTION

1. The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or HMM.

2. HMMs have two big advantages over FSAs for extraction.

· First, HMMs are probabilistic, and thus tolerant to noise. In a regular expression, if a single expected character is missing, the regex fails to match; with HMMs there is graceful degradation with missing characters/words, and we get a probability indicating the degree of match, not just a Boolean match/fail.

· Second, HMMs can be trained from data; they don’t require laborious engineering of templates, and thus they can more easily be kept up to date as text changes over time.

Figure 5.2 Hidden Markov model for the speaker of a talk announcement. The two square states are the target (note the second target state has a self-loop, so the target can match a strain of any length), the four circles to the left are prefix, and the one of the right is the postfix. For each state, only a few of the high-probability words are shown. From Freitag and McCallum (2000)

3. Once the HMMs have been learned, we can apply them to a text, using the Viterbi algorithm to find the most likely path through the HMM states. One approach is to apply each attribute HMM separately; in this case you would expect most of the HMMs to spend most of their time in background states. This is appropriate when the extraction is sparse - when the number of extracted words is small compared to the length of the text.

4. The other approach is to combine all the individual attributes into one big HMM, which would then find a path that wanders through different target attributes, first finding a speaker target, then a date target, etc. Separate HMMs are better when we expect just one of each attribute in a text and one big HMM is better when the texts are more free- form and dense with attributes.

5. HMMs have the advantage of supplying probability numbers that can help make the choice. If some targets are missing, we need to decide if this is an instance of the desired relation at all, or if the targets found are false positives. A machine learning algorithm can be trained to make this choice.

5.15 ONTOLOGY EXTRACTION FROM LARGE CORPORA

1. A different application of extraction technology is building a large knowledge base or ontology of facts from a corpus. This is different in three ways:

· First it is open-ended—we want to acquire facts about all types of domains, not just one specific domain.

· Second, with a large corpus, this task is dominated by precision, not recall—just as with question answering on the Web

· Third, the results can be statistical aggregates gathered from multiple sources, rather than being extracted from one specific text.

2. Here is one of the most productive templates: NP such as NP (, NP)* (,)? ((and | or) NP)?.

3. Here the bold words and commas must appear literally in the text, but the parentheses are for grouping, the asterisk means repetition of zero or more, and the question mark means optional.

4. NP is a variable standing for a noun phrase

5. This template matches the texts “diseases such as rabies affect your dog” and “supports network protocols such as DNS,” concluding that rabies is a disease and DNS is a network protocol.

6. Similar templates can be constructed with the key words “including,” “especially,” and “or other.” Of course these templates will fail to match many relevant passages, like “Rabies is a disease.” That is intentional.

7. The “NP is a NP” template does indeed sometimes denote a subcategory relation, but it often means something else, as in “There is a God” or “She is a little tired.” With a large corpus we can afford to be picky; to use only the high-precision templates.

8. We’ll miss many statements of a subcategory relationship, but most likely we’ll find a paraphrase of the statement somewhere else in the corpus in a form we can use.

5.16 AUTOMATED TEMPLATE CONSTRUCTION

Clearly these are examples of the author–title relation, but the learning system had no knowledge of authors or titles. The words in these examples were used in a search over a Web corpus, resulting in 199 matches. Each match is defined as a tuple of seven strings, (Author, Title, Order, Prefix, Middle, Postfix, URL), where Order is true if the author came first and false if the title came first, Middle is the characters between the author and title, Prefix is the 10 characters before the match, Suffix is the 10 characters after the match, and URL is the Web address where the match was made.

1. Each template has the same seven components as a match.

2. The Author and Title are regexes consisting of any characters (but beginning and ending in letters) and constrained to have a length from half the minimum length of the examples to twice the maximum length.

3. The prefix, middle, and postfix are restricted to literal strings, not regexes.

4. The middle is the easiest to learn: each distinct middle string in the set of matches is a distinct candidate template. For each such candidate, the template’s Prefix is then defined as the longest common suffix of all the prefixes in the matches, and the Postfix is defined as the longest common prefix of all the postfixes in the matches.

5. If either of these is of length zero, then the template is rejected.

6. The URL of the template is defined as the longest prefix of the URLs in the matches. The biggest weakness in this approach is the sensitivity to noise. If one of the first few templates is incorrect, errors can propagate quickly. One way to limit this problem is to not accept a new example unless it is verified by multiple templates, and not accept a new template unless it discovers multiple examples that are also found by other templates.

5.17 MACHINE READING

1. Traditional information extraction system that is targeted at a few relations and more like a human reader who learns from the text itself; because of this the field has been called machine reading.

2. A representative machine-reading system is TEXTRUNNER. TEXTRUNNER uses cotraining to boost its performance, but it needs something to bootstrap from.

3. Because TEXTRUNNER is domain-independent, it cannot rely on predefined lists of nouns and verbs.

Figure 5.3 Eight general templates that cover about 95% of the ways that relations are expressed in English

TEXTRUNNER achieves a precision of 88% and recall of 45% (F1 of 60%) on a large Web corpus. TEXTRUNNER has extracted hundreds of millions of facts from a corpus of half- billion Web pages.

5.18 NATURAL LANGUAGE PROCESSING

Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation (MAHT) or interactive translation) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

On a basic level, MT performs simple substitution of words in one language for words in another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Solving this problem with corpus statistical, and neural techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms and the isolation of anomalies.

MACHINE TRANSLATION SYSTEMS

All translation systems must model the source and target languages, but systems vary in the type of models they use. Some systems, attempt to analyse the sourced language text all the way in to an interlingua knowledge representation and then generate sentences in the target language from that representation. This is difficult because it involves three unsolved problems: creating a complete knowledge representation of everything; parsing into that representation; and generating sentences from that representation. Other systems are based on a transfer model.

Figure 5.4 The Vauquois triangle: schematic diagram of the choice for a machine translation system (Vauquois, 1968). We start with English test as the top. An interlingua-based system follows the solid lines, pursuing English first into a syntactic form, then into a semantic representation and an interlingua representation, and then through generation to a semantic, syntactic and lexical form in French. A transfer- based system uses the dashed lines as a shortcut. Different systems make the transfer at different points: some make it at multiple points.

STATISTICAL MACHINE TRANSLATION

1. Find parallel texts: First, gather a parallel bilingual corpus. For example, a Hansard is a record of parliamentary debate. Canada, Hong Kong, and other countries produce bilingual Hansards, the European Union publishes its official documents in 11 languages, and the United Nations publishes multilingual documents. Bilingual text is also available online; some Web sites publish parallel content with parallel URLs, for example, /en/ for the English page and /fr/ for the corresponding French page. The leading statistical translation systems train on hundreds of millions of words of parallel text and billions of words of monolingual text.

2. Segment into sentences: The unit of translation is a sentence, so we will have to break the corpus into sentences. Periods are strong indicators of the end of a sentences, but consider “Dr. J.R. Smith of Rodeo Dr. PAD $ 29.99 ON 9.9.09”; only the final period ends a sentence. One way to decide if a period ends a sentence is to train a model that takes as features the surrounding words and their parts of speech. This approach achieves about 98% accuracy.

3. Align sentences: For each sentence in the En1glish version, determine what sentence(s) it corresponds to in the French version. Usually, the next sentence of English corresponds to the next sentence of French in a 1:1 match, but sometimes there is variation: one sentence in one language will be split into a 2:1 match, or the order of two sentences will be swapped, resulting in a 2:2 match. By looking at the sentence lengths alone (i.e. short sentences should align with short sentences), it is possible to align (1:1, 1:2, or 2;2, etc) with accuracy in the 90% to 99% range using a variation on the Viterbi algorithm.

4. Align phrases: Within a sentence, phrases can be aligned by a process that is similar to that used for sentence alignment, but requiring iterative improvement. When we start, we have no way of knowing that “qui dort” aligns with “sleeping”, but we can arrive at that alignment by a process of aggregation of evidence.

5. Extract distortions: Once we have an alignment of phrases, we can define distortion probabilities. Simply count how often distortion occurs in the corpus for each distance.

6. Improve estimates with EM: Use expectation – maximization to improve the estimate of P(f|e) and P(d) values. We compute the best alignments with the current values of these parameters in the E step, then update the estimates in the M step and iterate the process until convergence.

SPEECH RECOGNITION

Definition: Speech recognition is the task of identifying a sequence of SPEECH words uttered by a speaker, given the acoustic signal. It has become one of the mainstream applications of AI.

1. Example: The phrase “recognize speech” sounds almost the same as “wreak a nice beach” when spoken quickly. Even this short example shows several of the issues that make speech problematic.

2. First segmentation: written words in English have spaces between them, but in fast speech there are no pauses in “wreck a nice” that would distinguish it as a multiword phrase as opposed to the single word “recognize”.

3. Second, coarticulation: when speaking quickly the “s” sound at the end of “nice” merges with the “b” sound at the beginning of “beach” yielding something that is close to a “sp”. Another problem that does not show up in this example is homophones – words like “to”, “too” and “two” that sound the same but differe in meaning

argmax P(word1:t | sound1:t) = argmax P(sound1:t | word1:t) P(word1:t). word1:t word1:t

Heere P (sound1:t |sound1:t) is the acrostic model. It describes the sound of words – that “ceiling” begins with a soft “c” and sounds the same as “sealing”. P (word1:t) is known as the language model. It specifies the prior probability of each utterance – for example, that “ceiling fan” is about 500 times more likely as a word sequence than “sealing fan”.

4. Once we define the acoustic and language models, we can solve for the most likely sequence of words using the Viterbi algorithm.

Acoustic Model

1. An analog-to-digital converter measures the size of the current – which approximates the amplitude of the sound wave at discrete intervals called as sampling rate.

2. The precision of each measurement is determined by the quantization factor, speech recognizers typically keep 8 to 12 bits. That means that a low-end system, sampling at 8 kHz with 8-bit quantization, would require nearly half a megabyte per minute of speech.

3. A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a particular language.

For example the “t” in “stick” sounds similar enough to the “t” in “tick” that speakers of English consider them the same phoneme.

Each frame is summarized by a vector of features. Below picture represents phone

model.

Figure 5.5 Translating the acoustic signal into a sequence of frames. In this diagram each frame is described by the discretized values of three acoustic features; a real system would have dozens of features

Building a speech recognizer

1. The quality of a speech recognition system depends on the quality of all of its components – the language model, the word-pronunciation models, the phone models, and the signal processing algorithms used to extract spectral features from the acoustic signals.

2. The systems with the highest accuracy work by training a different models for each speaker, thereby capturing differences in dialect as well as male / female and other variations. This training can require several hours of interaction with the speaker, so the systems with the most widespread adoption do not create speaker-specific models.

3. The accuracy of a system depends on a number of factors. First, the quality of the signal matters: a high-quality directional microphone aimed at a stationary mouth in a padded room will do much better than a cheap microphone transmitting a signal over phone lines from a car in traffic with the radio playing. The vocabulary size matters: when

recognizing digit strings with a vocabulary of 11 words (1-9 plus “oh” and “zero)”, the word error rate will be below 0.5%, where as it rises to about 10% on news stories with a 20,000-word vocabulary, and 20% on a corpus with a 64,000-word vocabulary. The task matters too: when the system is trying to accomplish a specific task – book a flight or give directions to a restaurant – the task can often be a accomplished perfectly even with a word error rate of 109% or more.

Figure 5.6A schematic diagram of the choices for a machine translation system. We start with English text at the top. An interlingua-based system follows the solid lines, parsing English first into a syntactic form, then into a semantic representation and an interlingua representation, and then through generation to a semantic, syntactic, and lexical form in French. A transfer-based system uses the dashed lines as a shortcut. Different systems make the transfer at different points; some it at multiple points.

5.19 ROBOT

1. Robots are physical agents that perform tasks by manipulating the physical world.

2. To do so, they are equipped with effectors such as legs, wheels, joints, and grippers.

3. Effectors have a single purpose: to assert physical forces on the environment.

4. Robots are also equipped with sensors, which allow them to perceive their environment.

5. Present day robotics employs a diverse set of sensors, including cameras and lasers to measure the environment, and gyroscopes and accelerometers to measure the robot’s own motion.

6. Most of today’s robots fall into one of three primary categories. Manipulators, or robot arms are physically anchored to their workplace, for example in a factory assembly line or on the International Space Station.

Robot Hardware

1. Sensors are the perceptual interface between robot and environment.

2. Passive sensors, such as cameras, are true observers of the environment: they capture signals that are generated by other sources in the environment.

3. Active sensors, such as sonar, send energy into the environment. They rely on the fact that this energy is reflected back to the sensor. Active sensors tend to provide more information than passive sensors, but at the expense of increased power consumption and with a danger of interference when multiple active sensors are used at the same time. Whether active or passive, sensors can be divided into three types, depending on whether they sense the environment, the robot’s location, or the robot’s internal configuration.

4. Range finders are sensors that measure the distance to nearby objects. In the early days of robotics, robots were commonly equipped with sonar sensors. Sonar sensors emit directional sound waves, which are reflected by objects, with some of the sound making it.

5. Stereo vision relies on multiple cameras to image the environment from slightly different viewpoints, analyzing the resulting parallax in these images to compute the range of surrounding objects. For mobile ground robots, sonar and stereo vision are now rarely used, because they are not reliably accurate.

6. Other range sensors use laser beams and special 1-pixel cameras that can be directed using complex arrangements of mirrors or rotating elements. These sensors are called scanning lidars (short for light detection and ranging).

7. Other common range sensors include radar, which is often the sensor of choice for UAVs. Radar sensors can measure distances of multiple kilometers. On the other extreme end of range sensing are tactile sensors such as whiskers, bump panels, and touch-sensitive skin. These sensors measure range based on physical contact, and can be deployed only for sensing objects very close to the robot.

8. A second important class of sensors is location sensors. Most location sensors use range sensing as a primary component to determine location. Outdoors, the Global Positioning System (GPS) is the most common solution to the localization problem.

9. The third important class is proprioceptive sensors, which inform the robot of its own motion. To measure the exact configuration of a robotic joint, motors are often equipped with shaft decoders that count the revolution of motors in small increments.

10. Inertial sensors, such as gyroscopes, rely on the resistance of mass to the change of velocity. They can help reduce uncertainty.

11. Other important aspects of robot state are measured by force sensors and torque sensors. These are indispensable when robots handle fragile objects or objects whose exact shape and location is unknown.

Robotic Perception

1. Perception is the process by which robots map sensor measurements into internal representations of the environment. Perception is difficult because sensors are noisy, and the environment is partially observable, unpredictable, and often dynamic. In other words, robots have all the problems of state estimation (or filtering)

2. As a rule of thumb, good internal representations for robots have three properties: they contain enough information for the robot to make good decisions, they are structured so that they can be updated efficiently, and they are natural in the sense that internal variables correspond to natural state variables in the physical world.

Figure 5.7 Robot perception can be viewed as temporal inference from sequence of actions and measurements, as illustrated by this dynamic Bayers network

2. Another machine learning technique enables robots to continuously adapt to broad changes in sensor measurements.

3. Adaptive perception techniques enable robots to adjust to such changes. Methods that make robots collect their own training data (with labels!) are called self-supervised. In this instance, the robot machine learning to leverage a short-range sensor that works well for terrain classification into a sensor that can see much farther.

5.20 PLANNING TO MOVE

1. All of a robot’s deliberations ultimately come down to deciding how to move effectors.

2. The point-to-point motion problem is to deliver the robot or its end effector to a designated target location.

3. A greater challenge is the compliant motion problem, in which a robot moves while being in physical contact with an obstacle.

4. An example of compliant motion is a robot manipulator that screws in a light bulb, or a robot that pushes a box across a table top. We begin by finding a suitable representation in which motion-planning problems can be described and solved. It turns

out that the configuration space—the space of robot states defined by location, orientation, and joint angles—is a better place to work than the original 3D space.

5. The path planning problem is to find a path from one configuration to another in configuration space. We have already encountered various versions of the path- planning problem throughout this book; the complication added by robotics is that path planning involves continuous spaces. T

6. Here are two main approaches: cell decomposition and skeletonization. Each reduces the continuous path- planning problem to a discrete graph-search problem. In this section, we assume that motion is deterministic and that localization of the robot is exact. Subsequent sections will relax these assumptions.

7. The second major family of path-planning algorithms is based on the idea of skeletonization. These algorithms reduce the robot’s free space to a one-dimensional representation, for which the planning problem is easier. This lower-dimensional representation is called a skeleton of the configuration space.

5.20.1 Configuration Space

1. It has two joints that move independently. Moving the joints alters the (x, y) coordinates of the elbow and the gripper. (The arm cannot move in the z direction.) This suggests that the robot’s configuration can be described by a four-dimensional coordinate: (xe, ye) for the location of the elbow relative to the environment and (xg, yg) for the location of the gripper. Clearly, these four coordinates characterize the full state of the robot. They constitute what is known as workspace representation.

2. Configuration spaces have their own problems. The task of a robot is usually expressed in workspace coordinates, not in configuration space coordinates. This raises the question of how to map between workspace coordinates and configuration space.

3. These transformations are linear for prismatic joints and trigonometric for revolute joints. This chain of coordinate transformation is known as kinematics.

4. The inverse problem of calculating the configuration of a robot whose effector location is specified in workspace coordinates is known as inverse kinematics. The configuration space can be decomposed into two subspaces: the space of all configurations that a robot may attain, commonly called free space, and the space of unattainable configurations, called occupied space.

5.20.2 Cell Decomposition Methods

1. The simplest cell decomposition consists of a regularly spaced grid.

2. Grayscale shading indicates the value of each free-space grid cell—i.e., the cost of the shortest path from that cell to the goal.

3. Cell decomposition methods can be improved in a number of ways, to alleviate some of these problems. The first approach allows further subdivision of the mixed cells— perhaps using cells of half the original size. This can be continued recursively until a path is found that lies entirely within free cells. (Of course, the method only works if there is a way to decide if a given cell is a mixed cell, which is easy only if the configuration space boundaries

4. Have relatively simple mathematical descriptions.) This method is complete provided there is a bound on the smallest passageway through which a solution must pass. One HYBRID A* algorithm that implements this is hybrid A*.

5.20.3 Modified Cost Functions

1. A potential field is a function defined over state space, whose value grows with the distance to the closet obstacle.

2. The potential field can be used as an additional cost term in the shortest-path calculation.

3. This induces an interesting trade off. On the one hand, the robot seeks to minimize path length to the goal. On the other hand, it tries to stay away from obstacles by virtue to minimizing the potential function.

4. There exist many other ways to modify the cost function. For example, it may be desirable to smooth the control parameters over time.

5.20.4 Skeletonization methods

1. The second major family of path-planning algorithms is based on the idea of

skeletonization.

2. These algorithms reduce the robot’s free space to a one-dimensional representation, for which the planning problem is easier.

3. This lower-dimensional representation is called a skeleton of the configuration space.

4. It is a Voronoi graph of the free space - the set of all points that are equidistant to two or more obstacles. To do path planning with a Voronoi graph, the robot first changes its present configuration to a point on the Voronoi graph.

5. It is easy to show that this can always be achieved by a straight-line motion in configuration space. Second, the robot follows the Voronoi graph until it reaches the point nearest to the target configuration. Finally, the robot leaves the Voronoi graph

and moves to the target. Again, this final step involves straight-line motion in configuration space.

Figure 5.8(a) A repelling potential field pushes the robot away from obstacles, (b) Path found by simultaneously minimizing path length and the potential

Figure 5.9 (a) The Voronoi graph is the set of points equidistance to two or more obstacles in configuration space (b) A probabilistic moodmap, composed of 100 randomly chosen points in free space.

Robust methods

1. A robust method is one that assumes a bounded amount of uncertainty in each aspect of a problem, but does not assign probabilities to values within the allowed interval.

2. A robust solution is one that works no matter what actual values occur, provided they are within the assumed intervals.

3. An extreme form of robust method is the conformant planning approach.

Text Box: Figure 5.10A two-dimensional environment, velocity uncertainty cone, and envelope of possible robot motions. The intended velocity is n, but with uncertainty the actual velocity could be anywhere in Cn, resulting in a final configuration somewhere in the motion envelope, which means we wouldn’t know if we hit the hole or not

Figure 5.11The first motion command and the resulting envelope of possible robot motions. No matter what the error, we know the final configuration will be to the left of the hole

Define Speech Recognition. Explain different types of speech recognition systems.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). Speech recognition is the process by which computer maps an acoustic speech signal to some form of abstract meaning of the speech. The fundamental aspect of speech recognition is the translation of sound into text and commands.

This process is highly difficult since sound has to be matched with stored sound bites on which further analysis has to be done because sound bites do not match with pre-existing sound pieces. Feature extraction technique and pattern matching techniques plays important role in speech recognition system to maximize the rate of speech recognition of various persons.

Types of speech recognition system based on utterances

1. Isolated Words Isolated word recognition system which recognizes single utterances

i.e. single word. Isolated word recognition is suitable for situations where the user is required to give only one word response or commands. It is simple and easiest for implementation because word boundaries are obvious and the words tend to be clearly pronounced which is the major advantage of this type. Eg. Yes/No, Listen / Not- Listen

2. Connected Words A connected words system is similar to isolated words, but it allows separate utterances to be - run-together‟ with a minimal pause between them.

Utterance is the vocalization of a word or words that represent a single meaning to the computer. Eg. drive car(action + object), baby book(agent + object) \

3. Continuous Speech Continuous speech recognition system allows users to speak almost naturally, while the computer determines its content. Continuous speech recognition system is difficult to develop. Eg. dictation

4. Spontaneous Speech Spontaneous speech recognition system recognizes the natural speech. Spontaneous speech is natural that comes suddenly through mouth. An ASR system with spontaneous speech is able to handle a variety of natural speech features such as words being run together. Spontaneous speech may include mispronunciation, false-starts and non words. Eg. Spontaneous speech may include mispronunciation, false-starts (will you – will you go to your college?) and non words("ums").

Explain about Automatic Speech Recognition System

Figure 5.12 System Architecture of for Automatic Speech Recognition System

Pre-processing/Digital Processing The recorded acoustic signal is an analog signal. An analog signal cannot directly transfer to the ASR systems. So these speech signals need to transform in the form of digital signals and then only they can be processed. These digital signals are move to the first order filters to spectrally flatten the signals. This procedure increases the energy of signal at higher frequency.

Feature Extraction Feature extraction step finds the set of parameters of utterances that have acoustic correlation with speech signals and these parameters are computed through processing of the acoustic waveform. These parameters are known as features or feature vectors (x₁, x₂, x₃, ...). The main focus of feature extractor is to keep the relevant information and

discard irrelevant one. Feature extractor divides the acoustic signal into 10-25 ms. Data acquired in these frames is multiplied by window function. There are many types of window functions that can be used such as hamming Rectangular, Blackman, Welch or Gaussian etc. Feature extraction methods: Principal component Analysis (PCA), Linear Discriminate Analysis (LDA), Wavelet, Independent component Analysis(ICA) etc.