research blog

Corpus Viewer

Sean Stewart - 2 December 2012

Allowing our users to view our corpus easily and functionally is of top priority. Avoiding a hierarchy of pages connecting section to chapter, we avowed a floating chapter navigator to solve the issue.

The chapter navigator is only available in the Corpus Viewer. It loads at the bottom, presenting the user with a quick way to swap sections located at the extreme sides of the navigation control. Selecting on a different section tints the navigation bar accordingly, allowing the user to recognise which section they are currently in.

Try it out and give us some feedback on how the presentation works. What would you do to make it better with ease-of-use and simplicity in mind?

Verne Countdown

Beth DeVito - 30 November 2012

The countdown is on! With less than two weeks before the project deadline, team Verne has continued to make some updates to its website, and has made a critical checklist of final activities needed to complete the analysis and presentation of Verne’s verbs! At this point, we have completed certain key aspects of our site, including overall design and a main project page explaining research objectives. To this, Source page information has been added, with our different mark up strategies represented in the provided links. In this way, the user will be able to explore, on his/her own, the techniques used throughout the progression of our project, and will thus help one to develop a better understanding of how this various software is used, as well as possibly becoming a resource for further research (whether our own or others).

However, there is still much to be done! Our current topics at hand include: finalizing our hash table so that verbs will receive appropriate categorization in their conjugated forms. After this is ready, we will begin data collection, which we must then realize in visual representations on our site. This conclusions page will be a crucial element that will require much attention. Apart from this, we need to place our text in a readable form on the website. We had initially planned to present this is a simulated book form, but may still tinker a bit with other display possibilities. We also want to provide an interactive verbal presentation within the text, and are considering linking techniques to a search page, or a hovering window that will provide verbal information. So, lots to do with little time…. We’re off to work!

New Pages Added

Sean Stewart - 30 November 2012

Not much this week has been focused on the linguistic data of the work. However, we mostly focused on updating our website to fill in a lot of what we had pushed off to the side for "later". Well, later's almost here, so we need to get truckin'.

Beth and I put some good thought on how we wish to present general information regarding our project in succinct ways that are also visually appealing. We created a Source page that contains links to all of the in-house things we used: XSLT sheets, our corpus – available in full-form or chapter-by-chapter, a program we wrote to communicate with Xerox, photos of how we planned our website, and some other things as well.

In addition to this Source page, we also drafted an About page that talks a little bit about or motivation for this project, why we chose this topic and author, and more importantly, a little bit about us: the researchers.

We're certain that our design won't be changing any time soon, so we're melding these new bits of inspiration we have to fit the site's layout in general, being sure to keep with colour schemes and the like. The only thing we've got left to do is present our corpus to our readers, which - well - we're going to need some help with.

Thanksgiving Time

Beth DeVito - 21 November 2012

So, we have a site updated with our project info! This past week and a half, we've been doing some website layout strategizing, which we will delve into more deeply once we've collected some results.

Sean has been laboriously developing a hash table, which will help label our verbs further--Xerox gave us each verb, but no tense. So, the hash table will help categorize each verb by its form, so we don't have to do that over 4,000 times (woohoo!). Some irregulars may have to undergo some human markup, but the hash table will sure save us lots of time! Once this is fully intact and ready to go, we will be able to classify every verb, and after that, data collection will be our last step!

Time is ticking, and we're moving along!

Verne's Verbs

Beth DeVito - 9 November 2012

Within the past week, the team has worked on both mark up strategies as well as layout and content organization of the webpage. First, coding strategies. With the help of the Xerox software, we first translated its grammatical markings into XML-acceptable tags. In using some XSLT stylesheets, Sean worked his magic in creating lists of distinct values of the verbs that appear. This cut our verb list down from 19,959 to a mere ~4,500 – whew! Many verbs, such as était (‘was’) contributed to this inflated number because of their frequency. Our job now is to create some type of verbal interconnectedness between different forms of the same verb, and then mark up each according to tense and possible conjugation. To do this, we are in the process of developing a sort of hash table that will serve as an automatic determiner of verb tense. If we have correctly identified general spelling patterns, the table will hopefully give us some appropriate output for regular verb constructions. We still expect to do some manual markup ourselves, as well, after having looped through this process.

As far as the webpage is concerned, we have added content and brainstormed more specific display and UX components. We’ve written up some of the body of our research project goals section, and have developed tentative section divisions. In addition to our home page, which has served as a blog-type update section, we plan to add an ‘About tab with general information about the project and the software used. The Corpus tab will, of course, display our text with replicated book pages and borders. We plan to add different navigational options to this text, as well as create links on every verb used, with both a box appearing with a hovering mouse, and a clickable hyperlink that will lead to results from the Search tab. This section may be farfetched/too complicated, but we hope to give the user the ability to search for a certain word, where output will give this verb in all of its forms and respective context (sentence, perhaps?). With this, one could then link the verb page to the page (in the Text tab) where it is found. Our Data section will, of course, give different presentations of our results. Conclusions will correspond with this, as it will explain our results and how we got there. With this updated layout and the progression of our markup process, we think we’ve made great strides this week!

A Thousand Leagues Later

Sean Stewart - 6 November 2012

An XML transformation sheet was created in order to accommodate the mass amounts of data received from Xerox, which was mal-formed HTML. However, all of the incorrect HTML structure was corrected by HTML Tidy. We created a C program that used OS X's built-in libtidy library and sent all of our chapters from Xerox through that program to rectify the malformedness.

Xerox provided us ultimately with disambiguated lexemes and senses, which we sorted through to create a new XML page for each chapter whereby each surface form has the correct underlying form (e.g. le vs. l') and correct part of speech.

Check our Source page for access to these transformation sheets and XML files

Update to the Stanford Tagger

Sean Stewart - 19 October 2012

While this utility proved to be a step in the right direction for our project, David and Na Rae helped us find a better solution: Xerox NLP Tools.

Integrating with Xerox was the difficult part. We had to develop an additional program (in Objective-C in this instance, though any language realistically would have worked) using sockets to transmit data encoded in JSON to and from Xerox's JSON REST API. After battling with the API for hours, we discovered how to get Xerox to mark up Verne's 20,000 Leagues Under the Sea. We let the process run over night, since Xerox's NLP tools have the ability to disambiguate meaning accurately, which means that Xerox had to create a relative network of data between each word in the text in order to disambiguate meaning. The entire process took exactly 12700 seconds (3.53 hours)

Check our Source page to download the XCode project responsible for our Xerox communication.

Stanford Part-of-Speech Tagger

Sean Stewart - 18 October 2012

For our project, Beth and I are going to use the part of speech tagger made by Stanford, called the Stanford Part-of-Speech tagger. For our research, we wish to find all of the verbs used by Jules Verne in his work "Vingt-mille lieues sous les meres" Twenty-thousand Leagues Under the Sea. In order to do so, we could have read the entire text and marked each verb's tense and aspect. Instead, we have acquired a pre-marked text and will send this marked-text through the Stanford POS tagger, which will spit out, for example a string such as: A_DT passenger_NN plane_NN has_VBZ crashed_VBN shortly_RB after_IN take-off_NN

We will then analyse each verb (denoted as _V in French) to see which form (literary or non-literary) it has been written in. Ultimately it is our goal to analyse the text as a whole in order to see if any of Verne's characters use the literary past tense during quotations or whether the literary past is solely used in Verne's own narration of past events.