Experience Level: Intermediate
The project is about reading bookmarks from a web browser and getting such information as title, url from the bookmarks. Instead of reading it directly from a browser, one can alternatively read the bookmarks from a database. Based on these URLs (bookmarks) the program crawls to these sites and reads the contents of each site and saves it as text document (HTML, Style, Script tags, punctuations, numbers etc.. are removed) in MySQL database. From these saved documents, term frequencies and inverse document frequencies are generated and saved in the database (probably with the creation of a bookmarks table (fields - url, title, sizeofdocument etc.) and a words (id, term etc.) table linking which word is in which document, maybe here due to the many-to-many relationship with the bookmarks and words table an aditional table could be needed e.g bookmark_words. I would then want the program to compare these documents based on frequencies of the words in each document(URL) using some sort of term weighting etc. there are many such examples available in the Internet, and group bookmarks according to their semantic closeness (based on the shared words). The result is then depicted in a graph illustrating bookmarks that have similar or near similar content. one simple technique is to use tag cloud.. by just taking say the first 150 words and looking up the bookmarks that contain these words. The same technique used in information retrieval can be applied here too to match similar documents..
There are no clarification messages.