LIBR 202 – Section 18
Information Retrieval – Professor Mary Bolin
Midterm Exam
Patricia Ayame Thomson
Answer to Question # 4
“What is the difference between precision and recall as measurements of an effective search?”
In his book “Ambient Findability,” Morville (2005) makes a salient statement: “Words intended to represent concepts: that is the questionable foundation upon which information retrieval is built” (p. 51). In his statement Morville is emphasizing the importance of language in context of subject representation. Whether it is a three-dimensional object, photograph, or computer software, words are tools used to describe the entity and create metadata for representation and retrieval.
Furthermore, Morville (2005) describes the ambiguity and complexity of language as follows: “Words are imprecise, ambiguous, indeterminate, vague, opaque . . .” (p. 51). Ultimately, the responsibility lies with the programmer, designer, indexer, and user to consider the power of words heavily and extensively. However, realistically out of the three contributors mentioned-above, the user without training will most-likely be least aware that their choice of words entered in the query is of critical importance. This is due to what Morville coins: “The People Problem.” Elaborating, Morville cuttingly states the following: “Today we call this infuriating variable ‘the user’ and we recognize that research must integrate rather than isolate the goals, behaviors, and idiosyncrasies of the people who use the systems” (p. 54).
Morville (2005) explain further how words influence and modify the information retrieval process: “In the context of retrieval, we might interpret these as the forces of description and discrimination” (Blair, 2002). In addition, Morville emphasizes the need for exhaustively assigning descriptions in the metadata. On behalf of Blair, Morville dictates the mandate as follows: “The force of description dictates that the intellectual content of documents should be described as completely as possible,” and additionally he states: “The force of discrimination dictates that documents should be distinguished from other documents in the system” (p. 52). The crucial point is without having concise and exhaustive attributes assigned in the metadata, the system has no way of knowing what the user wants.
Ideally, the user will be able to retrieve exactly what he or she is looking for every time on the first attempt. The race is on for programmers and software designers to build a system that is as close to a hundred percent successful retrieval rate as possible. In other words, a near-perfect system will be one with the capacity to retrieve precise and relevant information on the first attempt. Therefore, it stands to reason that the two most important functions in information retrieval are precision and recall.
In addition, precision and recall is one method of measuring the rate of effective searches. Precision measures the rate of accuracy or relevance of retrieved results in relationship to the user’s query. Recall measures the number of results of all the relevant retrievals in response to the user’s query. In other words, “precision and recall” is similar in relationship to “quality and quantity.” Morville further describes the concept more eloquently and precisely as follows: “Precision and recall, our most basic measures of effectiveness, are built upon this common-sense definition. Precision measures how well a system retrieves only the relevant documents. Recall measures how well a system retrieves all the relevant documents” (p. 49). Morville also makes an amusing, astute, and insightful observation. It is the fact that the user has no way of knowing how much relevant information passed through their fingers, and are missing from the retrieval results. As the old adage goes, “They don’t know what they’re missing.”
Morville explains further: “The relative importance of these metrics varies based on the type of search.” As an example, Morville states: “For the sample search in which a few good documents are sufficient, precision outweighs recall” (p. 49). In other words, the above concept brings to mind Zipf’s “Principle of Least Effort” when users want the most important, convenient, and fastest access to information (Morville, 2005, p. 44). He continues to explain that precision is crucial when the user knows the information already exists. He describes the significance as follows: “Precision is even more important for the known-item or existence search in which a specific document (or web site) is desired” (p. 49). This type of search has one correct answer. Finally, Morville cites the last example of the most common search: “For the exhaustive search when all or nearly all relevant documents are desired, recall is the key metric” (p. 50).
Morville (2005) claims: “The upshot of all this analysis is that while recall fails fastest, precision also drops precipitously as full-text retrieval systems grow larger” (p. 52). In other words, as the system gets larger and stores a greater number of documents, the system’s ability to retrieve documents from memory began to fail faster, but the level of relevance also begins to fail as the system becomes inundated with more and more information. Morville conveys that: “The larger system returns too many results with too many meanings” (p. 53). In the face of obstacles, Morville suggests there are things we can do to improve the system. For example, “That’s where metadata enters the picture. Metadata tags applied by humans can indicate aboutness thereby improving precision,” claims Morville (p. 53.) Thus, the more detailed and complete the aboutness of the entity is assigned and described in the metadata, the more precisely and efficiently the system is able to gather similar documents together (aggregate,) distinguish the ones that are not relevant (discriminate,) and achieve successful retrieval for the user.
The reason why search engines using full-text and natural language can be automated is because the process is relatively simple and human thought process is not involved. The process involves the system scanning, uploading, and storing the entire content of the text or document. Furthermore, natural language is synonymous to the natural way we speak. The search engine works to match the terms or keywords entered by the user and extract them directly from the entire or full text for retrieval. As Morville’s (2005) puts it: “Full text is biased towards description” (p. 52).
Dr. Bolin (2011) mentions that: “Recently, more sophisticated search engines have been developed that retrieves by “relevance,” rather than the number of occurrences of the word” (Lecture 8). In addition, Morville states his opinion as follows: “Though relevance ranking algorithms can factor in the location and frequency of word occurrence, there is no way for software to accurately determine aboutness” (p. 53).
Unfortunately, metadata is not perfect either. Attributes and values are assigned by humans. It is also a fact that human thought-process varies, and at times, are fallible. On the other hand, electronic devices do not have the capacity to think like human beings yet, or describe an object’s aboutness. Morville (2005) candidly states: “Despite the hype surrounding artificial intelligence, Bayesian pattern matching, and information visualization, computers aren’t even close to extracting or understanding or visually representing meaning” (p. 54). For instance, search engines function by relying on pre-indexed notions and symbols (including words) in the metadata, and points to the contents in surrogate records as cues to search, match, and retrieve the user’s query.
In conclusion, Morville (2005) provides two methods that will improve the system’s precision and recall. For enhancing precision, Morville suggests the use of controlled vocabularies. He suggests the following: “Controlled vocabularies (organized lists of approved words and phrases) for populating metadata fields can further improve precision through their discriminatory power” (p. 53). In order to enhance recall, Morville’s suggests connecting various non-linear relationships between terms, in other words, integrate a syndetic structure in the system. Morville states the relationships connecting terms in the following way: “The specification of equivalence, hierarchical, and associative relationships can enhance recall by linking synonyms, acronyms, misspellings, and broader, narrower, and related terms” (p. 53). In his book, “Ambient Findability,” Morville makes many insightful and valuable recommendations to facilitate information retrieval, and ultimately improve the infrastructure of the information architecture.