Rearview Mirror: PalmQA, a Question-Answering Ensemble for e-Learning and Research

My first bigger AI system developed under the supervision of Mohamed Amine Chatti & Ulrik Schroeder.

“The scientific mind does not so much provide the right answers
as ask the right questions.”

Claude Lévi-Strauss

1  Introduction

From Emanuel Goldberg’s Statistical Machine and Vannevar Bush’s Memex over collaboration systems like Ted Nelson’s Project Xanadu, Douglas Engelbart’s oN-Line System (NLS) and Tim-Berners Lee’s ENQUIRE – systems that enable rapid access to information have been a central topic of computer science since its beginning culminating in Apple’s Knowledge Navigator concept from 1987 that now has been largely implemented in personal assistants like Amazon Echo, Google Assistant, Microsoft Cortana, Apple’s Siri as well as question answering (QA) systems like IBM’s DeepQA, arguably the first breakthrough application of the Semantic Web, which defeated human grand champions of Jeopardy in 2011.

This work leverages QA systems for e-learning and research due to various potential benefits: Piaget has observed that cognition is a constructive process and while constructivism is not a uniform theory, but an amalgamation of different fields like neuroscience, cybernetics, psychology and computer science, it is clear that QA systems exhibit potential to facilitate learning as individual knowledge construction, since they allow the learner to freely and individually explore a field as well as the relations between its concepts. This is also evident for LaaN theory, a connectivist approach that understands learning as an emergent network-forming process or, more precisely, the continuous creation of personal knowledge networks (PKNs), which consist of tacit and explicit knowledge as well as the norms, values, strategies and assumptions that accompany them. PKNs are formed with a set of services, tools and devices that embody the convergence of formal and informal self-directed network learning and are called the personal learning environment (PLE). The most significant hindrance of a PLE is information overload, which constitutes the necessity of knowledge filters, the two most relevant groups of which are social filtering and recommendation systems. QA systems have the potential to form a third pillar of knowledge filters, since they are able to read large corpora of text, extract relevant passages, form an answer and reflect their reasoning why this particular answer was chosen. A second motivation is what is called the ”Long Tail”. While the term was coined and used much earlier, e.g. in the work of Benoit Mandelbrot, Chris Anderson has carried it over to the field of business to describe the popularity distribution of titles: While there are very few very popular titles, the so called ”short head”, there is also a ”long tail” of less popular and often niche content frequently ignored by mainstream media and search providers, which heavily rely on popularity. QA systems do not have to rely on popularity, but instead utilize an actual understanding of the content based on NLP making the long tail accessible to learners and researchers. This can not only help to find interesting and less well-known works and hypotheses, but also helps with obliterating obsolete or wrong hypotheses and misinterpretations from science, which are often fairly persistent based on their prevalence alone. Unknown to many users QA systems are not limited to text – they can also work on image and video content using computer vision techniques as well as audio via speech recognition. In general, they are prospective candidates for supplementing or even replacing traditional search engines.

1.1 Research Questions

The obvious research questions are which open source system is most suitable for this purpose, how its response times and answer rate can be improved and whether the system can be extended with additional information and adapted to specific domains. Additionally, it had to be investigated how non-explorative text box interaction and the restriction to factoid questions can be alleviated as well as how to integrate the resulting system with existing learning management systems like RWTH Aachen’s L²P or analytics platforms like PALM.


2  Related Work

Question answering systems receive a question in natural language as input and provide an answer. They can be open or closed domain and are oftentimes limited to factoid questions. Early QA systems like Baseball Green et al. (1961) in 1961 are mainly factoid and closed-domain. LUNAR Woods (1973) from 1971 is a NLIDB (Natural Language Interface to DataBases) which provides the advantage that users do not have to know the underlying database codes and structure, which is why the approach is still used 30 years later with Microsoft’s English Query. There are scenario-based QA systems like SHRLDU Winograd (1971) that allow block manipulation with speech, whereas QUALM Lehnert (1978) can answer questions about a brief story and it is mentioned in the corresponding dissertation how to apply the pipeline to general QA (GQA). The precision of closed solutions is sometimes remarkably high reaching up to 75 percent.

The 70s bring about the advent of knowledge bases leading to expert systems which are highly popular during the 80s, usually handcrafted by domain experts and targeting a wide range of decision support. While many new approaches like logic programming and rule-based systems are pioneered – Prolog is still used in DeepQA today – they ultimately failed. While more systems are released during the 80s and 90s like the UNIX Consultant Wilensky et al. (1988) which can answer basic questions about UNIX or the LILOG knowledge and inference engine Bollinger & Pletat (1991), there is a certain gap for 20 years. Symbolic approaches to NLU largely fail, but result in new IR techniques like PoS tagging and NER. The first QA system that combines IR techniques with NLP to tackle QA is MURAX Kupiec (1999) in 1993, which leads to a revival of the field supported by the proliferation of mobile devices and search engines. In 1999, TREC is established. TREC and the US ACQUAINT program support the development of IBM’s PIQUANT Prager et al. (2003); Chu-Carroll et al. (2004) leading to a long-term endeavor which culminated in the debut of Watson and its DeepQA pipeline that defeated human Jeopardy grand champions in a televised challenge leading to a new spark in interest.

DeepQA Ferrucci (2012); Lally et al. (2012); McCord et al. (2012); Chu-Carroll et al. (2012c); Fan et al. (2012); Chu-Carroll et al. (2012b); Murdock et al. (2012b;a); Wang et al. (2012); Kalyanpur et al. (2012a); Prager et al. (2012); Chu-Carroll et al. (2012a); Kalyanpur et al. (2012b); Gondek et al. (2012); Epstein et al. (2012); Tesauro et al. (2012); Lewis (2012) leverages both structured and unstructured knowledge sources. It is built upon the annotation-based NLP architecture UIMA Ferrucci & Lally (2004); Ferrucci et al. (2009). Like the earlier MULTEXT UIMA is abstractly defined – it has been an OASIS standard since 2009. Furthermore, it was donated to Apache, supports non-textual media like GATE, but also comes with its own cluster controller DUCC and its own rule engine Ruta. Its central element is the Common Analysis Structure (CAS), which contains the artifact to be analyzed, the so called Sofa (Subject OF Analysis) along with all metadata passed between pipeline components and created by Analysis Engines (AEs). Each annotator defines its own type system and they are afterwards fused into one. The whole analysis pipeline is built by UIMA based on a Collection Processing Engine (CPE), which is an XML descriptor defining the sequence of CollectionReaders, AEs and CAS Consumers. Non-sequential pipelines are possible via FlowControllers that describe workflow routes in BPEL. Their execution can be distributed via UIMA-AS (Asynchronous Scaleout) using Annotators as a Service that can be accessed via methods like JMS, SOAP and Vinci as well as AE proxies allowing massively parallel execution.

In order to determine an answer DeepQA first extracts keyphrases from the question and the focus, the part of the question that represents the answer. The focus is used to determine the Lexical Answer Type (LAT) with the basic idea that candidate answers can be matched to it in a process called Type Coercion (TyCor). Since no ontology could be found that is expressive enough, the LAT is a bag of words. Finally, the question is classified and based on this the following pipeline steps are chosen. Afterwards, multiple search engines are used to collect candidates in three variants: Title-Oriented Documents like wikis, passage search and structured data. For each generated candidate the evidence is called an Answer-Justifying Document (AJD). DeepQA uses nearly 500 strategies for scoring, which is done for both answers and passages, which are then combined into one confidence value via logistic regression. The open system YodaQA developed by Petr Baudis et. al. follows DeepQA closely, but has some limitations. It only only works on TODs and factoid questions and is currently largely restricted to search rather than more involved techniques like textual entailment. Nevertheless, at the time of writing it was the most advanced UIMA-based QA system with a reported accuracy@1 of 32.6%.


3  Contributions

Figure 1: Overview of Architecture

The project follows a microservice architecture (cmp. figure 1) and was developed following a Rapid Application Development (RAD) methodology under continuous user involvement. QA ensembles were chosen early on to significantly boost accuracy after synergies between Wolfram Alpha and YodaQA were observed and even explicitly pointed out by subjects during the pilot study. The initial ensemble included Amazon’s Evi, Google, Kngine, MIT’s Start, IBM’s Watson QA API (now discontinued) as well as Wolfram Alpha. Later, Bing was added. QAKiS was not included, since its precision is too bad and its main source DBpedia also covered by YodaQA. Besides the ensemble, two user interfaces were developed to accommodate both content consumers and creators, a web interface was devised for easy integration into other systems, a classifier was trained to detect predefined answers, a new messaging metaphor was invented and a suite to extend the corpus as well as to extract keyphrases and distributional semantics was developed. Regarding the latter aspects, it became apparent that the same ensemble approach for the QA systems is also applicable to keyphrase extraction and distributional semantics, which is why it was extended to offer general components that can be applied to generic backends beyond QA. Finally, the decision was drawn to leverage IBM’s cognitive cloud services and PaaS offering Bluemix resulting in a hybrid cloud design. The software can work offline and on-premise relying primarily on YodaQA or alternatively use additional services when connectivity is available.

3.1 Ensemble

The ensemble supports QA, keyphrase extraction and distributional semantics. All backends are deployed in Docker containers

3.1.1 Question Answering

To add additional QA backends, APIs were used for Watson and Wolfram, whereas for all other backends AJAX parsing via HTMLUnit was employed to extract answers from web interfaces. During the pilot study all user interaction along with system responses was logged to an H2 database via Hibernate, from which a set of 185 test queries was isolated in order to determine the number of correct and wrong responses as well as abstention, from which precision and recall can be derived for each backend. While the ensemble offers filters like thresholding of confidences and mergers for lists and answers, e.g. by majority vote, the best strategy seemed to be to call backends by descending precision and take the first answer. This way, only one answer is provided instead of a list. Rankers are conceptually included, but currently neither implemented nor used. While it was briefly tried to calculate several pairwise similarity metrics (Block, Chapman, Cosine, Dice, Euclidean, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Monge-Elkan, Needleman-Wunch, QGrams, Smith-Waterman, Soundex, Tag-Link) for candidates as well as their longest common substring in order to determine whether they should be merged, this approach has not been too successful, most likely since it lacks semantic information. Moreover, the performance with the given heuristic was close enough to the optimal result, which can also be determined from the individual responses by always considering a test query solved if any backend answers correctly, such that it made sense to focus additional efforts elsewhere.

FAQ Classifier Since some questions are too complex for current QA pipelines and instructors might want to extend answers to FAQs it makes sense to intercept pre-defined questions. To test this, 100 questions about core topics from two e-learning lectures and one seminar were identified along with answers. Afterwards, each question was rephrased a few times and the new variants added to the dataset in order to train a classifier on them, first with IBM’s Natural Language Classifier and then with Weka. Of course, students in an LMS could collaboratively extend questions and answers to improve covered topics and accuracy. In order to manually train a classifier the questions are first split into word vectors and then each word is weighed by its frequency based on the 13 million most frequent words derived from the Google Web Trillion Word Corpus as well as keyphrases extracted from the chair’s corpus. In order to create Weka’s Attribute-Relation File Format efficiently, the keyphrases are inserted into a PATRICIA trie that allows looking up words by prefixes and only needs to create one node and edge to insert a new entry.

3.1.2 Keyphrase Extraction

Keyphrase extraction is implemented both via Rapid Automatic Keyword Extraction (RAKE), which was ported to Python 3 for this purpose, or IBM’s Alchemy API. Both backends can be run simultaneously in order to only return common keyphrases, which significantly supports the quality, but also discards some rare but valid entries.

3.1.3 Distributional Semantics

One way IBM used to solve domain adaptation after having experienced a serious degradation of performance after switching from Jeopardy to specific domains is the introduction of distributional semantics via the JoBim Text framework Biemann et al. (2013) developed by IBM and TU Darmstadt (TUDA). Distributional Semantics follow the notion that a word’s meaning can be fully derived by its surrounding words – as John R. Firth famously said: ”You shall know a word by the company it keeps.” The idea was formulated as the Distributional Hypothesis by Zellig HarrisHarris (1951). Dr. Martin Riedl from TUDA kindly published their Wikipedia model and also computed a distributional thesaurus (DT) for the chair corpus, since the local Hadoop cluster ran out of memory. Based on this DT, JBT can be used to cluster words, extract Hearst patterns and ultimately label sense clusters, which are stored in a MariaDB instance that can then be queried at runtime, which is a first step towards supporting domain specific corpora. The practical use is that one can now perform lexical expansion, for instance the word ”exceptionally” can be expanded to ”extremely, extraordinarily, incredibly, exceedingly, remarkably” and so on, which e.g. allows expanding the LATs and investigating each expansion in separate threads. Beyond JBT the word embeddings approaches word2vec and GloVe were added to complete the ensemble and provide an elaborate test bed for upcoming research.

3.1.4 Containerization

In order to handle the different microservices, each one was placed in its own Docker image, which together are then orchestrated via Docker Compose. YodaQA requires DBpedia Lehmann et al. (2014), Freebase Bollacker et al. (2008), enwiki in a Solr database as well as two label services and an image for itself. Freebase and DBpedia both rely on Fuseki and can thus be combined in one image, the same goes for the label services. Volume mapping and command determine which ”phenotype” the image exhibits at runtime. Additionally, the ensemble and one interface for all QA systems in the cloud are placed in images just like one backend for Bluemix services and one for keyphrase extraction based on RAKE, which was ported to Python 3 and extended with a REST interface. Finally, there are three images for distributional semantics: A wrapper for JoBim Text (JBT) along with one based on MariaDB that wraps the database for JBT and a last one extending gensim in order to support word embeddings via both word2vec and GloVe. Please note that while communication is usually performed via REST the ensemble also features a Thrift mode to integrate with the intelligent personal assistant Lucida developed by the University of Michigan, which now uses it to improve their QA performance. YodaQA was modified to allow dynamically switching the backend URLs, since otherwise its code would have to be modified to be used with Docker links. The changes have been accepted upstream and both the Dockerfiles and Docker Compose configuration added to the official repository.

3.2 Corpus Import

Document Conversion in Bluemix is limited to 100MB regarding free usage. While the output is very clean with little clutter and formula conversion to ASCII, it was thus advantageous to use Tika directly in order to convert several gigabytes of e-learning papers, which also supports language detection to make sure only English papers are selected. In addition, the LAK dataset was converted from RDF and Prof. Dr. Daniel Schneider from the University of Geneva provided an XML dump of the Edutech wiki that was also converted to Solr’s XML format using JSoup and some manual cleaning via Emacs. Note that the second label service has to be deactivated in YodaQA by making the method return an empty list, since it is difficult to replace as the source mentions explicitly: ”The dictionary is a large- scale resource which would be difficult to reconstruct in a university setting, without access to a comprehensive web-crawl.” The file for the other label service can be generated, but the overall performance felt so bad that domain adaptation was postponed for now, which does not come at a surprise, since it is a very demanding task.

3.3 User Experience

Figure 2: GUI Modularization
Figure 3a: Overview of End User GUIs
Figure 3b: Overview of Performance Analysis and Development GUIs

As aforementioned and illustrated in figure 3, there are two main applications: One for content consumers and one for content creators. The first has four modes: A main menu, QA, a dialog component and a web view to display sources like Wikipedia articles, DBpedia entries or Bing hits. Theoretically, it can also be used to query web QA systems manually, which sometimes makes sense, e.g. since it displays Wolfram’s content pods better (esp. true for Pro).

Just like the backends were modeled as decoupled microservices, the GUI is highly modular as well as depicted in figure 2. The header is independent of the interchangeable bodies, answers and their evidence are displayed in embedded web views, tables are JavaFX properties bound via lambda expressions and dialog labels as well as FAQ widgets are programmatically generated. Each GUI component is defined in FXML and has its own controller, which is orchestrated by a main controller.

3.3.1 Messaging Metaphor

The messaging metaphor represents every QA backend as a contact in a list similar to messaging applications. Furthermore, it allows the users via a simple scripted dialog to explore topics by giving a list of top level topics that can then be further explored in subsequent questions. The messaging metaphor is not a core part of this research and thus neither a full dialog system nor IPA, but merely a demo to give the subjects an impression of the idea in order to provide a practical background for questions about it in the main user study.

Figure 4: Overview of Messaging Metaphor
3.3.2 Web GUI and Integration

The web GUI is based on Vaadin and is a modified version of its addressbook example. As such, it can be embedded into any iframe, works well across browsers and can be developed in Java, which is then compiled into HTML and JavaScript, which supports coherence in the project, since all main components and GUIs are implemented in Java. Besides the WebGUI the system is integrated with PALM, RWTH Aachen’s academic learner model, which provides access to distributed publication information. This allows to collect papers by author or conference, which is useful to augment the corpus. A final integration that has been added very recently is with the dialog system platform Lucida developed at the University of Michigan. By implementing their Thrift interface and replacing the existing QA backend it is possible to use the ensemble instead of the default choice, OpenEphyra.

PalmQA integrated into RWTH Aachen’s PALM system for scientific literature research.

3.4 Backends as Microservices

The following picture shows the backends running as microservices. Part of this was pushed upstream into YodaQA to help make their deployment easier as well.

Backend Services Running in Docker Compose
Additional IBM Cloud Services PalmQA Can Leverage

4  Evaluation

Three user studies were conducted – a pilot, a main study and a Grand Challenge – as well as several quantitative analysis to assess the system.

4.1 Quantitative Evaluation

YodaQA was the slowest in its default configuration, which uses non-local data backends (avg. 23.9, speedup 1, min. 5, max. 64, all in seconds), slightly faster in the official demo on ailao.eu (19.8,1.21,2,51) and already twice as fast when the data backends are run locally (12.03,1.98). YodaQA exhibits conspicuously high IO rates indicating IO rather than CPU as the bottleneck and indeed using a small normal consumer SSD with Freebase remaining on HDD yields significant results (8.53s avg., 2.8x speedup, min. 1s, max. 25s). Interestingly, neither bigger SSDs nor (peculiarly) a professional Samsung 950 Pro with M.2 NVMe over PCIe 3.0 with much better throughput (1582MB/s instead of 269MB/s for consumer 850 Evo) resulted in significantly better response times. Even if Freebase resides on a 850 Evo 500GB and all other backends on the 950 Pro performance stays comparable (8.3, 2.88, 1, 19). Docker overhead is almost neglectable, e.g. when all data backends reside on one 850 Evo 500GB the speedup with and without Docker is (9.53,2.51,2,28) vs. (10.2,2.34,2,29), which equals 93% of the original speedup. Strangely, even with each backend on its own SSD these numbers stay almost constant with (10.2,2.34,1,29). Finally, it should be noted that when going from 32GB RAM to 16GB RAM response times only decrease from (8.6,2.78,1,27) to (9.23,2.59,2,26), so the bottom line is that adding one SSD provides a huge benefit, whereas all other parameters have much lower impact.

To get an impression of the ensemble members’ performance they were run against 185 factoid questions. This revealed the following ordering of web QA backends by precision, where the first numbers give the counts of correct answers, partially correct ones, incorrect ones and abstention followed by precision and recall: Google (58,0,1,126,0.98,0.31), Kngine (19,1,1,163,0.95, 0.10), Evi (74,5,6,100,0.925,0.4), Wolfram Alpha (82,2,9,92,0.90,0.44) and Start (73,5,10,96,0.88,0.39).

4.2 User Studies

There were three user studies: the pilot, the main study and the Grand Challenge. While there have been many studies researching the ideal number of participants, for instance by Dalal and MallowsDalal & Mallows (19881990), an interesting summary of 16 evaluations was provided by Nielsen and LandauerNielsen & Landauer (1993) in 1993 who observed that the optimal benefit/cost ratio is already reached at 4 participants with saturation around 16 subjects. Accordingly, the divide and conquer principle applied to the architecture and GUI also makes sense for studies: It is more beneficial to conduct multiple small studies than a huge one.

The pilot followed a within-subjects design and had 12 participants who compared the ensemble with Watson and afterwards disclosed their general impression of QA systems. All age groups, English proficiency levels and educational levels except for Bachelor at University of Applied Sciences were covered. The interest was notably high with 812 questions asked in total which equals 67.7 questions per user on average. While the shortest completion time was 24 minutes, some users voluntarily spent over 2 hours experimenting with the system. The first ensemble was rated as either promising (7) or already useful (5), whereas Watson was consistently considered to be faster than the ensemble. 8 reported Watson to be already useful, which is 3 more than Yoda, but 3 thought it needs a lot of work (compared to none for Yoda). Watson was considered useful for explanations by 5 (vs. 1) subjects (both regarding Healthcare and Travel). For each system 5 participants even considered it for research. Overall, nobody considered QA systems a toy or a technical revolution. 8 thought of them as a valuable addition to search engines and 3 saw the potential to replace them. 11 found them useful for learning and everyday questions, 7 for research.

The main study compared YodaQA in default configuration with YodaQA with the ensemble and used a between-subjects design with a 2 × 2 latin square in order to counterbalance which system was used first and thus compared to. Furthermore, the 16 subjects were equally divided into male and female participants. As the result, the ensemble was reported to be both faster (median 4.5, avg. 4.44) and provide more correct answers (5, 4.81). The answer quality (4, 4.31), answer details (4, 4.06) and messaging metaphor (4, 4.19) were rated well, the predefined answer classifier even achieved excellent results (5, 4.625).

The final user study was a Grand Challenge in which 26 humans (13 male, 13 female) competed against the QA ensemble on 30 factoid questions provided by a colleague not involved in the project. The best human managed to answer 24 questions correctly, while the human average was 11. The ensemble on the other hand was able to answer 25 questions, one more than the best human and more than twice as many as the average human competitor. This improvement is statistically highly significant: A one-tailed single sample t-test to significance level 0.01 yields a p value under 0.00001.


5  Conclusion

While, as shown, the ensemble exhibits superhuman performance (twice as many correct answers as before) with average response times significantly lower than 10 seconds (speedup of almost 3 times) and has evoked tremendous user engagement, it also has clear limitations. While at least three out of four factoid questions are answered correctly, oftentimes along with evidence like Wikipedia articles, which provides a starting point for learning and research, the system still lacks the deeper understanding to give an account of a field, to provide explanations or to automatically add meaningful information where appropriate. While intercepting answers alleviates the issue and is well accepted by users, it is clearly a workaround until better reasoning capabilities become available. The messaging metaphor has proven itself useful to explore a topic, but Niegemann has pointed out that highly self-directed learning rarely leads to better results. Thus, instead of stopping at QA, the instructor should be kept in the loop similar to Wang et al.’s approach. The biggest challenge, however, is turning QA ensembles into actual research tools. While response times and answer rates are already fine for an initial version, the system needs to support papers and books. Distributional semantics, keyphrase extraction, the ensemble itself as well as document conversion are steps in this direction, but much work remains, especially since YodaQA is still limited to TODs. It is an interesting property that the system supports both lists of answers as well as selecting the most promising answer. This way, the response can be adapted to the target group: Based on the experience with IBM Watson, to accommodate researchers it might be beneficial to provide text passages rather than isolated answers and go back to lists of the most promising answers. Summa summarum, the good news is that all remaining work except context sensitivity is already prepared in the current project. It can be easily deployed via Docker, many components are now also available on Docker Hub and with very few modifications it can be integrated into existing systems with its web interface. In the easiest scenario one could just slightly change the final step of this research project to allow students to either ask an instructor or search the web if a question was not answered to their satisfaction in order to use the system in production today.


References

   Biemann, C., Riedl, and M. Text: Now in 2d! a framework for lexical expansion with contextual similarity. Journal of Language Modelling, 1(1):55–95, April 2013.

   Bollacker, Kurt, Evans, Colin, Paritosh, Praveen, Sturge, Tim, and Taylor, Jamie. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. ACM, 2008.

   Bollinger, Toni and Pletat, Udo. The lilog knowledge representation system. SIGART Bulletin, 2(3):22–27, 1991. URL http://dblp.uni-trier.de/db/journals/sigart/sigart2.html#BollingerP91.

   Chu-Carroll, J., Brown, E. W., Lally, A., and Murdock, J. W. Identifying implicit relationships. IBM Journal of Research and Development, 56(3.4):12:1–12:10, May 2012a. ISSN 0018-8646. doi: 10.1147/JRD.2012.2188154.

   Chu-Carroll, J., Fan, J., Boguraev, B. K., Carmel, D., Sheinwald, D., and Welty, C. Finding needles in the haystack: Search and candidate generation. IBM Journal of Research and Development, 56(3.4):6:1–6:12, May 2012b. ISSN 0018-8646. doi: 10.1147/JRD.2012.2186682.

   Chu-Carroll, J., Fan, J., Schlaefer, N., and Zadrozny, W. Textual resource acquisition and engineering. IBM Journal of Research and Development, 56(3.4):4:1–4:11, May 2012c. ISSN 0018-8646. doi: 10.1147/JRD.2012.2185901.

   Chu-Carroll, Jennifer, Czuba, Krzysztof, Prager, John M., Ittycheriah, Abraham, and Blair-Goldensohn, Sasha. Ibm’s piquant ii in trec 2004. In Voorhees, Ellen M. and Buckland, Lori P. (eds.), TREC, volume Special Publication 500-261. National Institute of Standards and Technology (NIST), 2004. URL http://dblp.uni-trier.de/db/conf/trec/trec2004.html#Chu-CarrollCPIB04.

   Dalal, S. R. and Mallows, C. L. When should one stop testing software? Journal of the American Statistical Association, 83(403):872–879, 1988.

   Dalal, S. R. and Mallows, C. L. Some graphical aids for deciding when to stop testing software. IEEE Journal on Selected Areas in Communications, 8(2):169–175, Feb 1990. ISSN 0733-8716. doi: 10.1109/49.46868.

   Epstein, E. A., Schor, M. I., Iyer, B. S., Lally, A., Brown, E. W., and Cwiklik, J. Making watson fast. IBM Journal of Research and Development, 56(3.4):15:1–15:12, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2188761.

   Fan, J., Kalyanpur, A., Gondek, D. C., and Ferrucci, D. A. Automatic knowledge extraction from documents. IBM Journal of Research and Development, 56(3.4):5:1–5:10, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2186519.

   Ferrucci, D. A. Introduction to ”this is watson”. IBM Journal of Research and Development, 56(3.4):1:1–1:15, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2184356.

   Ferrucci, David and Lally, Adam. Uima: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3-4): 327–348, September 2004. ISSN 1351-3249. doi: 10.1017/S1351324904003523. URL http://dx.doi.org/10.1017/S1351324904003523.

   Ferrucci, David, Lally, Adam, Verspoor, Karin, and Nyberg, Eric. Unstructured information management architecture (uima) version 1.0, oasis standard. Technical report, OASIS, March 2009.

   Gondek, D. C., Lally, A., Kalyanpur, A., Murdock, J. W., Duboue, P. A., Zhang, L., Pan, Y., Qiu, Z. M., and Welty, C. A framework for merging and ranking of answers in deepqa. IBM Journal of Research and Development, 56(3.4):14:1–14:12, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2188760.

   Green, Jr., Bert F., Wolf, Alice K., Chomsky, Carol, and Laughery, Kenneth. Baseball: An automatic question-answerer. In Papers Presented at the May 9-11, 1961, Western Joint IRE-AIEE-ACM Computer Conference, IRE-AIEE-ACM ’61 (Western), pp. 219–224, New York, NY, USA, 1961. ACM. doi: 10.1145/1460690.1460714. URL http://doi.acm.org/10.1145/1460690.1460714.

   Harris, Zellig. Methods in Structural Linguistics. Methods in Structural Linguistics. University of Chicago Press, 1951.

   Kalyanpur, A., Boguraev, B. K., Patwardhan, S., Murdock, J. W., Lally, A., Welty, C., Prager, J. M., Coppola, B., Fokoue-Nkoutche, A., Zhang, L., Pan, Y., and Qiu, Z. M. Structured data and inference in deepqa. IBM Journal of Research and Development, 56(3.4):10:1–10:14, May 2012a. ISSN 0018-8646. doi: 10.1147/JRD.2012.2188737.

   Kalyanpur, A., Patwardhan, S., Boguraev, B. K., Lally, A., and Chu-Carroll, J. Fact-based question decomposition in deepqa. IBM Journal of Research and Development, 56(3.4): 13:1–13:11, May 2012b. ISSN 0018-8646. doi: 10.1147/JRD.2012.2188934.

   Kupiec, Julian M. Natural Language Information Retrieval, chapter Murax: Finding and Organizing Answers from Text Search, pp. 311–332. Springer Netherlands, Dordrecht, 1999. ISBN 978-94-017-2388-6. doi: 10.1007/978-94-017-2388-6_13. URL http://dx.doi.org/10.1007/978-94-017-2388-6_13.

   Lally, A., Prager, J. M., McCord, M. C., Boguraev, B. K., Patwardhan, S., Fan, J., Fodor, P., and Chu-Carroll, J. Question analysis: How watson reads a clue. IBM Journal of Research and Development, 56(3.4):2:1–2:14, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2184637.

   Lehmann, Jens, Isele, Robert, Jakob, Max, Jentzsch, Anja, Kontokostas, Dimitris, Mendes, Pablo N, Hellmann, Sebastian, Morsey, Mohamed, van Kleef, Patrick, et al. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 2014.

   Lehnert, W.G. The Process of Question Answering: A Computer Simulation of Cognition. Artificial Intelligence Series. L. Erlbaum Associates, 1978. ISBN 9780470264850. URL https://books.google.de/books?id=iupQAAAAMAAJ.

   Lewis, B. L. In the game: The interface between watson and jeopardy! IBM Journal of Research and Development, 56(3.4):17:1–17:6, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012. 2188932.

   McCord, M. C., Murdock, J. W., and Boguraev, B. K. Deep parsing in watson. IBM Journal of Research and Development, 56(3.4):3:1–3:15, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2185409.

   Murdock, J. W., Fan, J., Lally, A., Shima, H., and Boguraev, B. K. Textual evidence gathering and analysis. IBM Journal of Research and Development, 56(3.4):8:1–8:14, May 2012a. ISSN 0018-8646. doi: 10.1147/JRD.2012.2187249.

   Murdock, J. W., Kalyanpur, A., Welty, C., Fan, J., Ferrucci, D. A., Gondek, D. C., Zhang, L., and Kanayama, H. Typing candidate answers using type coercion. IBM Journal of Research and Development, 56(3.4):7:1–7:13, May 2012b. ISSN 0018-8646. doi: 10.1147/JRD.2012.2187036.

   Nielsen, Jakob and Landauer, Thomas K. A mathematical model of the finding of usability problems. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems, CHI ’93, pp. 206–213, New York, NY, USA, 1993. ACM. ISBN 0-89791-575-5. doi: 10.1145/169059.169166. URL http://doi.acm.org/10.1145/169059.169166.

   Prager, J. M., Brown, E. W., and Chu-Carroll, J. Special questions and techniques. IBM Journal of Research and Development, 56(3.4):11:1–11:13, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2187392.

   Prager, John M., Chu-Carroll, Jennifer, Czuba, Krzysztof, Welty, Christopher A., Ittycheriah, Abraham, and Mahindru, Ruchi. Ibm’s piquant in trec2003. In Voorhees, Ellen M. and Buckland, Lori P. (eds.), TREC, volume Special Publication 500-255, pp. 283–292. National Institute of Standards and Technology (NIST), 2003. URL http://dblp.uni-trier.de/db/conf/trec/trec2003.html#PragerCCWIM03.

   Tesauro, G., Gondek, D. C., Lenchner, J., Fan, J., and Prager, J. M. Simulation, learning, and optimization techniques in watson’s game strategies. IBM Journal of Research and Development, 56(3.4):16:1–16:11, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2188931.

   Wang, C., Kalyanpur, A., Fan, J., Boguraev, B. K., and Gondek, D. C. Relation extraction and scoring in deepqa. IBM Journal of Research and Development, 56(3.4):9:1–9:12, May 2012. ISSN 0018-8646. doi: 10.1147/JRD.2012.2187239.

   Wilensky, Robert, Chin, David N., Luria, Marc, Martin, James, Mayfield, James, and Wu, Dekai. The berkeley unix consultant project. Comput. Linguist., 14(4):35–84, December 1988. ISSN 0891-2017. URL http://dl.acm.org/citation.cfm?id=65120.65123.

   Winograd, Terry A. Procedures as a representation for data in a computer program for understanding natural language. PhD thesis, 1971. URL http://opac.inria.fr/record=b1000605. PHD.

   Woods, W. A. Progress in natural language understanding: An application to lunar geology. In Proceedings of the June 4-8, 1973, National Computer Conference and Exposition, AFIPS ’73, pp. 441–450, New York, NY, USA, 1973. ACM. doi: 10.1145/1499586.1499695. URL http://doi.acm.org/10.1145/1499586.1499695.

“You are tired of always needing answers. Always answering questions. Always
asking questions that demand answers. Pretending all the questions have to be
answered. Pretending there are actually answers. And even getting paid to
convince others they’re true. That there is such a thing as right answers.”

Thomas Lloyd Qualls, Waking Up at Rembrandt’s

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s