Sentence classification and named identity detection with automatic retraining

Started by
2 comments, last by markypooch 5 years, 9 months ago

Hi Folks,

I am learning Artificial Intelligence  and trying out my first real-life AI application. What I am trying to do is taking as an input various sentences, and then classifying the sentences into one of X number of categories based on keywords, and 'action' in the sentence.

The keywords are, for example, Merger, Acquisition, Award, product launch etc. so in essence I am trying to detect if the sentence in question talks about a merger between two organizations, or an acquisition by an organisation, a person or an organization winning an award, or launching of a new product etc.

To do this, I have made custom models based on the basic NLTK package model, for each keyword, and trying to improve the classification by dynamically tagging/updating the models with related keywords, synonyms etc to improve the detection capability. Also, given a set of sentences, I am presenting the user with the detected categorization and asking whether its correct or wrong, and if wrong, what is the correct categorization, and also identify the entities.

So the object is to first classify the sentence into a category, and additionally, detect the named entities in the sentence, based on the category.

The idea is, to be able to automatically re-train the models based on this feedback to improve its performance over time and to be able to retrain with as less manual intervention as possible. For the sake of this project, we can assume that user feedback would be accurate.

The problem I am facing is that NLK is allowing fixed length entities while training, so, for example, a two-word award is being detected as two awards.

What should be my approach to solve this problem? Is there a better NLU (even a commercial one) which can address this problem? It seems to me that this would be a common AI problem, and I am missing something basic. Would love you guys to have an input on this.

Thanks & Regards

Camillelola

Advertisement

Please not tht this is a "a,game AI" forum 99% of the people here wouldn't have any clue about the problem here and of those and of tha t do perhaps 25% might have an inking.

 

Regardless, a game AI forum is not great place to ask games about non-game AI. 

Dave Mark - President and Lead Designer of Intrinsic Algorithm LLC
Professional consultant on game AI, mathematical modeling, simulation modeling
Co-founder and 10 year advisor of the GDC AI Summit
Author of the book, Behavioral Mathematics for Game AI
Blogs I write:
IA News - What's happening at IA | IA on AI - AI news and notes | Post-Play'em - Observations on AI of games I play

"Reducing the world to mathematical equations!"

Hey there,

IADave has a point, typically this thread is reserved for Game AI, and not Natural Language Processing, or other forms of knowledge-based mining algorithms. I haven't worked much with NLTK offered by Python, but have used Naive Bayes/ID3 to perform sentiment analysis on sentences.

It sounds like you are working with a Supervised model. That is, you have labeled training examples presented to your algorithm on what constitutes its classification. Where the disconnect seems to be is that it sounds like you aren't using a Bag-of-words approach, but instead a keyword approach, how would your current algorithm attempt to classify, "Our engineers will be launching a rocket into low-earth orbit this afternoon."?

Unless you know in advance the type of sentences that your algorithm will be expected to classify, fixating on keywords that you think are relevant may not be the best approach. Instead, I'd advise a simple occurrence + bag of words approach. That is, keep track of  all unique words, and their occurrences in your training data in relation to its training label (Acquisition, merger, launch, ect.), remove stop words (and, its, the, ect.), perform stemming on your words,  ((programmer, programming) == program), and present that data to your algorithm to have it determine what qualifiers in a sentence given the training data encompasses a sentence with 'x' label.

Quote

The problem I am facing is that NLK is allowing fixed length entities while training, so, for example, a two-word award is being detected as two awards. 

I don't know the level of abstraction you are working with, but that doesn't sound like an issue you should have if you are working with a bag-of-words approach.

Quote

What should be my approach to solve this problem? Is there a better NLU (even a commercial one) which can address this problem? It seems to me that this would be a common AI problem, and I am missing something basic. Would love you guys to have an input on this.

NTLK is probably one the easier frameworks out there to quickly perform Natura Language Classification. I think it might help you to not start with a high-level framework, but to actually implement an algorithm yourself for learning purposes. Take a peek at the link provided to implement a simple supervised classifier yourself

https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/

 

This topic is closed to new replies.

Advertisement