Looking for a webservice for automatic text / document classification

Rating: +0

Positive Negative


As we are involved in the development of webbased knowledge management systems (KMS's) we would like to know whether webservices are available that can help automatically tag documents / ascii files. Our own metadata platform can handle the creation of taxonomies and vocabularies and so on, as wel as storing the text(fragments) that have to be tagged. what is missing is a way to link the right taxonomy items to the docs. Basically, it should work as follows (I guess)

stage 1 : creation of taxonomies / vocabularies
stage 2 : uploading trainingset of documents
stage 3 : manual tagging of documents = training
stage 4 : uploading documents to be tagged
stage 5 : automatic classification (a document is tagged only when certainty of classification is above a threshold)

Tom,

With training sets as part of your requirement, it appears that you are looking for more "intelligent" classification based on Latent Semantic algorithms or similar. If concepts and “conceptual search” is the end goal, and this is a very important piece of your overall system (the conceptual search aspect), you may have to look into commercial solutions that you can run yourself (OEM) that has an API which you can wrapper with web services to expose the functions that you need to access via that method.

The training sets alone pose some difficulty, since I would imagine that the training sets payloads would be quite large - the full text of a months worth of foreign language newspapers might be a training set to train the system.

You can't get away from the I/O on the backend for processing, it has to run someplace, this might as well be the same place where you run the document tagging/categorization, or these should not be separate systems but the same technology responsible for index and search. You will get different mileage from any of these technologies based on your volume requirements.

For example, I had a need for applying concept search algorithms index and search to large pools of documents, the document sizes were 250K on average and totaled 10mil per day. I did not want two systems, one for search and index and another for concept search. After testing all of the largest commercial vendors and some of the smaller and startups we could not find any that could handle our volume requirements, at least not without a ridiculous amount of additional infrastructure which would make a commercial offering cost prohibitive, not to mention very difficult to support. We reduced the testing requirements to a million documents, to be processed within 10hours on a single CPU, and even that became a stretch. I realized that the goal of having a single application for search index which included concept search was to lofty for the time, and a search and index and a specialized concept search adapter would have to be developed in order to use different engines against smaller subsets of documents.

The evaluation exercise did convince me of one thing, which was that the need for training sets was a problem. If you are building a "service" where you do everything in the background, that might be OK. But, if you are building a commercial offering which you license the software then training sets is painful. There are algorithms (see Recommind) that do not need training sets. A few others below - Autonomy is an obvious vendor, although last time I checked pretty expensive.

Good Luck,
Peter
October 2008

* http://www.exalead.com
* http://www.recommind.com/
* http://www.teragram.com/


Speak Your Mind

*