Ashurbanipal, a text recommendation engine

Posted on October 6, 2015 by Tommy McGuire
Labels: ashurbanipal, books, digital humanities, java

Summary

The Ashurbanipal project is a prototype of a text recommendation engine based entirely on attributes internal to the texts, such as style and topic. It is based on a snapshot of English texts from Project Gutenberg in 2010 and includes approximately 24,000 works of no-longer-in-copyright or otherwise freely-available fiction and non-fiction.

The URL is http://dpg.crsr.net.

Introduction

Recommendation engines are very important, both economically and, well, culturally. Now, I assume I can leave the first side of that "and" to your imagination, but for the second, they have become some of the most important ways that cultural artifacts such as books and videos are discovered in this world. Not necessarily by replacing other means, but by being more ubiquitous and by being at least reasonably successful.

On the other hand, most recommendation engines have a significant flaw: they are, essentially, popularity contests. Consider the Amazon recommendation system, for books and for everything else you need in daily life, or the Netflix system that was the subject of a relatively recent computer science contest: both are built on either

In neither case do the recommendation systems look at any attributes in the articles being recommended. Certainly, they look at attributes of the articles, like author, cast and category, but as far as I have been able to find out, Amazon doesn't open the text of the books being recommended and the Netflix Prize contest certainly involved nothing of the contents of the movies in the data set.

Existing recommendation systems work well. (I'm still wondering how my gen-one Tivo knew to record a cool old psychic documentary narrated by Leonard Nimoy after I'd had it set up for less than six hours.) There is no question that they are pretty good at what they do and what they do is a big part of the solution to the overall problem of discovery.

But popularity-based recommendation engines have a glaring weakness: unpopular artifacts. If a video is too new to have any review data and is not a production of a known cast, it's not likely to appear in any recommendation lists. Even if it's exactly what someone wants to see. Books by long dead authors are still pretty readable and still pretty good. (Well, ok, I admit I'm not a wild fan of Tess of the d'Urbervilles.) But I have yet to hear of anyone who has had H. Rider Haggard show up in their Amazon recommendations uninvited, in spite of having a plethora of published options. If an artifact has few connections to other, well-known artifacts, like ratings or shopping-cart interactions, then it isn't really even a candidate for recommendation.

What do to about this situation? I personally know of a couple of ongoing attempts to build for music what I'll call "internal" recommendation engines (as opposed to recommendation engines built on factors "external" to the item under recommendation): you like that beat, we'll find you other songs with beats like that, and so on.

Unfortunately, I'm not especially interested in music and I really have no idea what I'm doing. Fortunately, however, I am interested in books (Muahahaha, books, books, hahaha.) (Why does that always happen?) and I do have some experience munging around with text.

My attempt at building an internal text recommendation engine is Ashurbanipal.

Ashurbanipal

Ashurbanipal himself was the last (successful) king of the Neo-Assyrian empire. Like all of the other Assyrian kings (and most of the other ancient near-eastern rulers) he was a right rat bastard, cruel to his enemies and big-talking in his monuments. In fact, the British Museum (Naturally; like, where else would it be?) has a relief of him and his wife enjoying a lovely garden party with the king of Elam's head hanging from a nearby tree.

But Ashurbanipal's claim to my interest is his library, which is the largest single known collection of ancient cuneiform literature, if I recall correctly. Ashurbanipal may or may not have been a scholar-prince, taught to read and write because he was not expected to take the throne. The library also may or may not have been the first ancient library to be indexed, may or may not have been intended to preserve Sumerian and Akkadian cuneiform literature and culture in the face of the post-bronze age Aramaic culture, and may or may not have been the inspiration for Alexander's library in Alexandria.

Ashurbanipal, the software, is a collection of mostly Java (and a little Rust, at the moment) applications currently built to

Probably the most interesting program in this collection is run-tag-todolist and the Java program TagTodoList.java that backs it, the program which processes Project Gutenberg text files (stored in .zip form on the April 2010 DVD image; the simplest way to download a good data set) and produces the style and topic data that is the subject of this current bloggage. But more on that later.

What the beast does

Texts, particularly but not exclusively fiction, have a number of interesting internal attributes which could provide a set of handles for a recommendation engine. Attributes like style, topic, plot, characterization, and undoubtedly (well, hopefully) others. Ashurbanipal currently uses the first two of those.

Style is my first target, because it has been extensively studied (sort-of) and approached in a fashion that I can use for recommendations (if you squint a little).

Style

In practice, stylometry is typically used for authorship attribution questions: Did Shakespeare really write this piece of garbage? Is Lief Erickson responsible for the stupid joke about whoever is buried in Grant's tomb? This seems a reasonable situation; if you like Charles Dickens' writing in A Tale of Two Cities, you might very well like his writing in Bleak House. (One thing I've noticed is that many style recommendation lists lead rather shortly to Bleak House. I've got no idea what that means.)

Many different approaches have been taken to stylometry, from plausible sounding but ultimately unhelpful things like sentence length to completely bogus, did-anyone-ever-buy-this? things (cough, cough, cusum). One standard method, however, seems to have bubbled to the top, due both to success and computational ease: the proportion of various "function words" or "stop words" in a segment of the text. Function words are those which carry little actual meaning in the text, but which serve to provide the grammatical structure on which content or lexical words hang like shiny cherries on the tree. They're sometimes known as "stop words" because they're ignored in most diddling-about with words, a short-sighted and uncouth fact rarely mentioned by "traditional" "computational" "linguists".

The idea is that you grub out say 5000 words from a text and count the number of uses of "an", "the", "or" and so on. The counts of each roughly match other segments from the same author and significantly differ for different authors.

In my own nigh-infinite wisdom, I completely ignored this tactic. (In fact, I didn't read about it until I'd already started writing code, and I rather like my approach so I'm running with it.) Instead, in a fine application of my "If all you have is a hammer and a screwdriver, every problem looks like a threaded nail" principle, I took an off-the-shelf part-of-speech tagger and counted the number of each reported part of speech for each text, which I then normalized by dividing the counts by the total number of words in the text.

The result is a matrix with one row per text, containing approximately 45 columns with headings like "singular common noun", "determiner", and less obviously, "numeral". (Actually, it uses the Penn Treebank tag-set, so those are "NN", "DT", and "CD".) Each value in the row is a positive number between zero and one, typically very close to zero. To make style recommendations based on a chosen text, I compute the Euclidian distance between that text and all of the others, then sort the list by the resulting distances. It seems to produce reasonable answers.

(I fully intend to collect functional-word information at some point soonish and compare those stylistic results to the POS results I have. However, so far I have done little in the way of cross validation. So there, nyah.)

For one example, the first book from a different author in the list of style recommendations for Jane Austen's Sense and Sensibility is His Heart's Queen by Mrs. Georgie Sheldon (1843-1926; slightly later than I would have expected). Using the "Page 63" test (i.e., turn to page 63 of a book and read it to see if the author has been smoking too much crack to be acceptable; in actual fact, I scrolled down until the tabs were a ways down and approximately equivalent), I find

"No; my feelings are not often shared, not often understood. But sometimes they are." As she said this, she sunk into a reverie for a few moments; but rousing herself again, "Now, Edward," said she, calling his attention to the prospect, "here is Barton valley. Look up to it, and be tranquil if you can. Look at those hills! Did you ever see their equals? To the left is Barton park, amongst those woods and plantations. You may see the end of the house. And there, beneath that farthest hill, which rises with such grandeur, is our cottage."

"It is a beautiful country," he replied; "but these bottoms must be dirty in winter."

"How can you think of dirt, with such objects before you?"

"Because," replied he, smiling, "among the rest of the objects before me, I see a very dirty lane."

"How strange!" said Marianne to herself as she walked on.

"Have you an agreeable neighbourhood here? Are the Middletons pleasant people?"

"No, not all," answered Marianne; "we could not be more unfortunately situated."

"Marianne," cried her sister, "how can you say so? How can you be so unjust? They are a very respectable family, Mr. Ferrars; and towards us have behaved in the friendliest manner. Have you forgot, Marianne, how many pleasant days we have owed to them?"

"No," said Marianne, in a low voice, "nor how many painful moments."

Elinor took no notice of this; and directing her attention to their visitor, endeavoured to support something like discourse with him, by talking of their present residence, its conveniences, &c. extorting from him occasional questions and remarks. His coldness and reserve mortified her severely; she was vexed and half angry; but resolving to regulate her behaviour to him by the past rather than the present, she avoided every appearance of resentment or displeasure, and treated him as she thought he ought to be treated from the family connection.

from Sense and Sensibility; and this

"Oh, I was afraid you would think me very bold---that you would regard me with contempt," Violet sighed, tremulously. "After my letter had gone, and I tried to think over what I had written more calmly, and to wonder how you would regard it, I was almost sorry that I had sent it."

"'Almost,' but not really sorry?" questioned Wallace, with a fond smile.

"No, for I had to tell you the truth, if I told you anything, and no one can be sorry for being strictly candid," she returned, "and," with a resolute uplifting of her pretty head, while she looked him straight in the eyes, "why should I not tell you just what was in my heart? Why does the world think that a woman must never speak, no matter if she ruins two lives by her silence? You told me that you loved me, although you did not ask me if I returned your affection; but I knew that my life would be ruined if I did not make you understand it. I do love you, Wallace, and I will not be ashamed because I have told you of it."

The young man was deeply moved by this frank, artless confession. He knew there was not a grain of indelicacy or boldness in it; it was simply a truthful expression of a pure and noble nature, the spontaneous outburst of a holy affection responding to the sacred love of his own heart, and the avowal aroused a profound reverence for an ingenuousness that was as rare as it was perfect.

He bent down and touched his lips to her silken hair.

"There is no occasion," he said, earnestly, "and you have changed all my life, my dear one, by adopting such a straightforward course. Still," he added, with a slight smile, "I did not come here intending to tell you just this, or with the hope that our interview would result in such open confessions."

"Did you not?" Violet asked, quickly, and darting a startling look at him.

from His Heart's Queen. Using the same technique, from Moby Dick

By the mainmast; Starbuck leaning against it.

My soul is more than matched; she's overmanned; and by a madman! Insufferable sting, that sanity should ground arms on such a field! But he drilled deep down, and blasted all my reason out of me! I think I see his impious end; but feel that I must help him to it. Will I, nill I, the ineffable thing has tied me to him; tows me with a cable I have no knife to cut. Horrible old man! Who's over him, he cries;---aye, he would be a democrat to all above; look, how he lords it over all below! Oh! I plainly see my miserable office,---to obey, rebelling; and worse yet, to hate with touch of pity! For in his eyes I read some lurid woe would shrivel me up, had I it. Yet is there hope. Time and tide flow wide. The hated whale has the round watery world to swim in, as the small gold-fish has its glassy globe. His heaven-insulting purpose, God may wedge aside. I would up heart, were it not like lead. But my whole clock's run down; my heart the all-controlling weight, I have no key to lift again.

[A burst of revelry from the forecastle.]

Oh, God! to sail with such a heathen crew that have small touch of human mothers in them! Whelped somewhere by the sharkish sea. The white whale is their demigorgon. Hark! the infernal orgies! that revelry is forward! mark the unfaltering silence aft! Methinks it pictures life. Foremost through the sparkling sea shoots on the gay, embattled, bantering bow, but only to drag dark Ahab after it, where he broods within his sternward cabin, builded over the dead water of the wake, and further on, hunted by its wolfish gurglings. The long howl thrills me through! Peace! ye revellers, and set the watch! Oh, life! 'tis in an hour like this, with soul beat down and held to knowledge,---as wild, untutored things are forced to feed---Oh, life! 'tis now that I do feel the latent horror in thee! but 'tis not me! that horror's out of me! and with the soft feeling of the human in me, yet will I try to fight ye, ye grim, phantom futures! Stand by me, hold me, bind me, O ye blessed influences!

And the first non-Herman Melville, non-Various text is Edgar Allan Poe's The Works of Edgar Allen Poe --- Volume 4:

But now a new horror presented itself, and one indeed sufficient to startle the strongest nerves. My eyes, from the cruel pressure of the machine, were absolutely starting from their sockets. While I was thinking how I should possibly manage without them, one actually tumbled out of my head, and, rolling down the steep side of the steeple, lodged in the rain gutter which ran along the eaves of the main building. The loss of the eye was not so much as the insolent air of independence and contempt with which it regarded me after it was out. There it lay in the gutter just under my nose, and the airs it gave itself would have been ridiculous had they not been disgusting. Such a winking and blinking were never before seen. This behavior on the part of my eye in the gutter was not only irritating on account of its manifest insolence and shameful ingratitude, but was also exceedingly inconvenient on account of the sympathy which always exists between two eyes of the same head, however far apart. I was forced, in a manner, to wink and to blink, whether I would or not, in exact concert with the scoundrelly thing that lay just under my nose. I was presently relieved, however, by the dropping out of the other eye. In falling it took the same direction (possibly a concerted plot) as its fellow. Both rolled out of the gutter together, and in truth I was very glad to get rid of them.

The bar was now four inches and a half deep in my neck, and there was only a little bit of skin to cut through. My sensations were those of entire happiness, for I felt that in a few minutes, at farthest, I should be relieved from my disagreeable situation. And in this expectation I was not at all deceived. At twenty-five minutes past five in the afternoon, precisely, the huge minute-hand had proceeded sufficiently far on its terrible revolution to sever the small remainder of my neck. I was not sorry to see the head which had occasioned me so much embarrassment at length make a final separation from my body. It first rolled down the side of the steeple, then lodge, for a few seconds, in the gutter, and then made its way, with a plunge, into the middle of the street.

Ok, that's gross. "A Predicament", I'm afraid; I'd never read that one before. But there we go! That was the goal! Success!

(Anyway, the problem with Various is that it is, in this particular case, an issue of Atlantic Monthly with no attributed articles. A collection by various authors, writing on different subjects, may well match some other text in toto, but it's not very likely to be valid. I'm not saying that an enthusiastic reader of Moby Dick wouldn't find something good in a given issue of Atlantic Monthly, but I don't want to say that they would, either. And the Page 63 thing isn't going to find it.)

In terms of non-fiction, here is The Descent of Man

I have remarked that sexual selection would be a simple affair if the males were considerably more numerous than the females. Hence I was led to investigate, as far as I could, the proportions between the two sexes of as many animals as possible; but the materials are scanty. I will here give only a brief abstract of the results, retaining the details for a supplementary discussion, so as not to interfere with the course of my argument. Domesticated animals alone afford the means of ascertaining the proportional numbers at birth; but no records have been specially kept for this purpose. By indirect means, however, I have collected a considerable body of statistics, from which it appears that with most of our domestic animals the sexes are nearly equal at birth. Thus 25,560 births of race- horses have been recorded during twenty-one years, and the male births were to the female births as 99.7 to 100. In greyhounds the inequality is greater than with any other animal, for out of 6878 births during twelve years, the male births were to the female as 110.1 to 100. It is, however, in some degree doubtful whether it is safe to infer that the proportion would be the same under natural conditions as under domestication; for slight and unknown differences in the conditions affect the proportion of the sexes. Thus with mankind, the male births in England are as 104.5, in Russia as 108.9, and with the Jews of Livonia as 120, to 100 female births. But I shall recur to this curious point of the excess of male births in the supplement to this chapter. At the Cape of Good Hope, however, male children of European extraction have been born during several years in the proportion of between 90 and 99 to 100 female children.

close to which comes Cactus Culture for Amateurs Being Descriptions of the Various Cactuses Grown in This Country, With Full and Practical Instructions for Their Successful Cultivation (hey, I'm really not making this stuff up).

C. speciosissimus (most beautiful). --Although not a night-flowering kind, nor yet a climber, yet this species resembles in habit the above rather than the columnar-stemmed ones. It is certainly the species best adapted for cultivation in small greenhouses or in the windows of dwelling-houses, as it grows quickly, remains healthy under ordinary treatment, is dwarf in habit, and flowers freely---characters which, along with the vivid colours and large size of the blossoms, render it of exceptional value as a garden plant. Its stems are slender, and it may be grown satisfactorily when treated as a wall plant. For its cultivation, the treatment advised for Phyllocactuses will be found suitable. When well grown and flowered it surpasses in brilliancy of colours almost every other plant known. Specimens with thirty stems each 6 ft. high, and bearing from sixty to eighty buds and flowers upon them at one time, may be grown by anyone possessing a warm greenhouse. The stems are three to five angled, spiny, the tufts of spines set in little disks of whitish wool. The flowers are as large as tea saucers, with tubes about 4 in. long, the colour being an intense crimson or violet, so intense and bright as to dazzle the eyes when looked at in bright sunlight. When cut and placed in water they will last three or four days. April and May. Mexico, 1820. "Numberless varieties have been raised from this Cereus, as it seeds freely and crosses readily with other species. Many years ago, Mr. D. Beaton raised scores of seedlings from crosses between this and C. flagelliformis, and has stated that he never found a barren seedling. Much attention was given to these plants about fifty years ago, for Sir E. Antrobus is said to have exhibited specimens with from 200 to 300 flowers each. I have been informed that an extremely large plant of this Cereus, producing hundreds of flowers every season, is grown on the back wall of a vinery at the Grange, Barnet, the residence of Sir Charles Nicholson, Bart." (L. Castle).

In point of fact, this choice of stylometric data is quite good at separating fiction from non-fiction, as well as (usually) finding texts authored in roughly the same time period.

Now that I can identify texts which read similarly (hey, they do to me, although I may be completely tone-deaf here), it would be nice to find texts which are related by something more solid.

Topic

As you can see by the Moby Dick to Edgar Allan Poe and Charles Darwin to cactus growing recommendations, style alone is likely to be a poor (or at least weird) overall choice. A more obvious recommendation would be based on something I think of as "topic".

A topic, as I think of the term in this context, is something combining elements of genre and setting, subject matter and background. As far as topics go, Sense and Sensibility and His Heart's Queen are English romances of a certain period; Moby Dick is a sea story with whales; Edgar Allan Poe is, well..., "A Predicament" is sort of a clock thing, or possibly an anatomical thing; The Descent of Man is science, natural history; and Cactus Culture for Amateurs similarly biological.

Defining a topic is hard, perhaps impossible. However, they're pretty easy to spot when you see them. For Ashurbanipal, I use the part-of-speech data to pick out common nouns from the text, count the nouns, and record the 200 most common nouns in each text. (This is, in fact, the screwdriver from my threaded-nail.)

In order to avoid multiple entries for the same word, I use the Stanford POS tagger's edu.stanford.nlp.process.Morphology class to "lemmatize" the words, a process of, in effect, stemming the words with knowledge of their part of speech. This process should be able to tell the difference between the noun "meeting" and the verb "to meet".

Comparing texts is fairly easy; I use the Jaccard distance between the two sets of nouns to compute a number between zero and one representing the distance between the two sets. A smaller distance means the texts are closer together and therefore the candidate is a more likely recommendation.

BookRecommendations
Sense and SensibilityPride and Prejudice, Emma, Persuasion, Mansfield Park, Northanger Abbey, and Maria Edgeworth's Tales and Novels
Moby DickFighting the Whales by R.M. Ballantyne, Old Jack by William Henry Giles Kingston, and Great Sea Stories
The Descent of ManThe Origin of Species, The Variation of Animals and Plants Under Domestication, and Darwinism by Alfred Russel Wallace
A Tale of Two CitiesLittle Dorrit, Barnaby Rudge, Bleak House, and Our Mutual Friend

Matthew Jockers' excellent Macroanalysis presents an alternative, algorithmic way of identifying topics. He, too, chooses to separate nouns, but then uses a topic modeling technique, Latent Dirichlet allocation, to categorize the nouns into weighted clusters. Further, he seems to have struck on the same overall approach of Euclidian distance, both in terms of stylometric measures and topic modeling categories, to relate texts by similarity. (Honestly, I did not discover Macroanalysis until I had most of Ashurbanipal written. I claim independent discovery.)

Combination

Having multiple recommendation techniques is nice, but combining them into a single, "best" recommendation would be most useful. Unfortunately, this is the part of the system that requires the most validation and I have had very little feedback. Currently, I am doing the simplest thing possible: I multiply the Euclidian distance in style by the Jaccard distance in topic; since the latter is always between 0 and 1, it serves to reduce the style distance proportionally to the topic distance.

You pays your money and you takes your chances.

Ashurbanipal, the code

Way back at the top, I mentioned run-tag-todolist, which calls the Java program TagTodoList.java. This program computes the part-of-speech and noun count data that is used to make recommendations, given a to-do list of etext numbers, language, content type, and file location information. In TagTodoList.java, the list of things to process goes down and around and eventually winds up in a thread pool running a Callable class called TaggerCallable.java. TaggerCallable reads the text file for English books out of the .zip file in the DVD image. The Project Gutenberg licensing and advertising information are stripped out (mostly successfully) by code stolen and translated from Clemens Wolff, and then the text is broken into approximately 10kb chunks to prevent the Stanford POS tagger from blowing through the heap. The actual break is made between paragraphs, or at least on an empty line which should be a paragraph transition. Going back to TaggerCallable, each of the fragments is processed by the tagger and the results are accumulated for each text.

In case you're interested in such things, here is a skeleton of the code invoking the tagger:


private final TokenizerFactory<CoreLabel> tokenizerFactory
= PTBTokenizer.factory(new CoreLabelTokenFactory(),
"asciiQuotes,untokenizable=noneKeep");
private final MaxentTagger tagger
= new MaxentTagger("english-left3words-distsim.tagger");
private final Morphology morphology = new Morphology();
...
{
final DocumentPreprocessor documentPreprocessor
= new DocumentPreprocessor(text);
documentPreprocessor.setTokenizerFactory(tokenizerFactory);
...
for (List<HasWord> sentence : documentPreprocessor) {
for (TaggedWord word : tagger.tagSentence(sentence)) {
// word count
words++;
if ("NN".equals(tag) || "NNS".equals(tag)) {
// get base form of word
String lemma = morphology.stem(word).toString();
if (lemma == null) {
lemma = word.toString();
...
}
...
}
...
}
...
}
...
}

I assembled that mess by looking at the commands from the Stanford POS tagger distribution and cut-n-pasting things from their source. It looks like it works, anyway .

As far as the programs using the collected data go, I have already discussed much of them in Reimplementing ashurbanipal.web in Rust, where I walked through the process of converting the Java servlets into a Rust program in order to reduce their memory footprint (and improve their speed). Command-line versions of the recommendation programs are also to be found in the Ashurbanipal project.

The prototype

The way Ashurbanipal recommendations work begins with a book selected by the user, something similar to which he or she would like to read. To use the prototype, go to the page, find a book that you know you like (using the search field at the upper left), and get style-based, topic-based, and combined recommendations for books which are in some way similar to your choice.

Here's the current web site: http://dpg.crsr.net.

To find a book, enter an author's name, a title, or some likely subject term (or one or more words from any of those) into the text field. The lovely and talented server will provide a drop-down list of possibilities; select one.

The information about the text you have selected will appear to the right of the text box, while the recommendations will appear in three rows below. Each list of recommendations can be scrolled left or right via the arrows; left indicates a higher recommendation and right a lower. Initially, the left-most visible book is the highest recommendation and the book you have selected will be the only thing to appear if you scroll left.

Click on the a recommendation's title to select it as the base book for more recommendations and on the Project Gutenberg link to go to the book's page at PG, where the book can be downloaded in a plethora of formats.

In the future, well, I have an idea for supporting plot recommendations and am actively looking for further attributes.

References

Existing recommendation engines

Amazon

Netflix

Books

Ashurbanipal, the dead guy

Ashurbanipal, the software

Assorted gibberish

active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.