Thursday, July 10, 2008

Ruby - Build My Own Dictionary

One day, I got an idea suddenly which is try to build out my own English dictionary. To make a dictionary, I need a source of dictionary for my database of words and the source must be in convenient form to be manipulated later. So, after I searched through the internet and I found one website is suitable to be the source of my database - http://www.synonym.com/.

That website provides a search engine for Synonyms, Antonyms, and Definitions. Besides that, user can browses through list of words which are categorized alphabetically.

After study how the website works, I manage to come out 3 steps to build my own dictionary by using Ruby.

Step #1: Generate a list of words

The website contains a browse webpage for the words which are categorized alphabetically. The alphabet starts from "AA" until "ZZ". The URLs for the browse web page shown at below:
http://www.synonym.com/synonyms/
http://www.synonym.com/antonym/
http://www.synonym.com/definition/

The URLs for alphabets are straight forward. Example, synonym words starting with AA are listed inside web page with URL:
http://www.synonym.com/synonyms/browse/AA
And the same rule also applied to the Antonym and Definition.

By looping through "AA" until "ZZ", finally I manage to list out all the words that available in that website. The total of words is 145291. Hmm...quite a big size of database. How long the Ruby can build up the database completely?


Step #2: Retrieve each word's definition, synonym and antonym to build a database

After generate the words list, I need to retrieve the information about each word, such as synonym, antonym, and definition. The way to retrieve those information are straight forward too. Example, word "account" and "account for" of definition are inside web page with URLs below respectively:
http://www.synonym.com/definition/account/
http://www.synonym.com/definition/account%20for/

I estimated need at least 160 hours (8 hours * 20 days) to finish the retrieving information process. But, if I split out the word list into number of small word lists, then the retrieving process will be finished more quickly. For example, 20 small word lists are estimated need 8 hours (or 1 days) only to finish.


Step #3: Post-Processing the database to suit my own needs.

After I have the "raw" database of all words, I need to perform some manipulations, filtering, transformations, and categorizing processes onto each word to have the style and format that I want to. In this step, Ruby done a great job!


Some interest information from http://www.synonym.com/

  • The longest word is tetrabromo-phenolsulfonephthalein (without space), or
    blood-oxygenation level dependent functional magnetic resonance imaging (with space).
  • Most words are start with alphabet 'S' (15700), followed by 'C' (14038) and 'P' (11319). And the least words are starts with 'X' (174), followed by 'Z' (390) and 'Y' (503).
  • Nouns (75.61%), adjectives (13.90%), verbs (7.58%) and adverbs (2.91%).

No comments: