{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.6"},"colab":{"name":"Sentiment_classification_scikit_learn.ipynb","provenance":[{"file_id":"1zV24gqXke5eJXNkbWNLpfQCxUfcemx-T","timestamp":1607100651393}]}},"cells":[{"cell_type":"markdown","metadata":{"id":"W7WERJpaqbom"},"source":["# Sentiment Classification\n","\n","https://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html\n","\n","Download the \"Sentiment Polarity Dataset Version 2.0\" from http://www.nltk.org/nltk_data/ and put in the defined folder. \n","The dataset zip file is here: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip"]},{"cell_type":"code","metadata":{"id":"0lmjTveWqbon"},"source":["import sklearn\n","from sklearn.datasets import load_files\n","moviedir = r'./movies'"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"yhhoLUBeqbop"},"source":["# Data loading and preparation\n","### Load the dataset and inspect its content"]},{"cell_type":"code","metadata":{"id":"7isCrTGJqbop","outputId":"663daa72-941a-47c7-de80-4a9ecc633b9b"},"source":["movie = load_files(moviedir, shuffle=True)\n","len(movie.data)"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["2000"]},"metadata":{"tags":[]},"execution_count":3}]},{"cell_type":"code","metadata":{"id":"PNqhlVeXqbos","outputId":"9420c071-2c5f-40fb-f565-ae449e288974"},"source":["movie.target_names"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['neg', 'pos']"]},"metadata":{"tags":[]},"execution_count":4}]},{"cell_type":"code","metadata":{"id":"1_QC1O1hqbos","outputId":"c0a28499-56fd-4a95-9785-5e835a5e193f"},"source":["movie.data[0]"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["b\"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \\nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \\nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \\nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \\nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \\nparts of this are actually so absurd , that they would fit right in with dogma . \\nyes , the film is that weak , but it's better than the other blockbuster right now ( sleepy hollow ) , but it makes the world is not enough look like a 4 star film . \\nanyway , this definitely doesn't seem like an arnold movie . \\nit just wasn't the type of film you can see him doing . \\nsure he gave us a few chuckles with his well known one-liners , but he seemed confused as to where his character and the film was going . \\nit's understandable , especially when the ending had to be changed according to some sources . \\naside form that , he still walked through it , much like he has in the past few films . \\ni'm sorry to say this arnold but maybe these are the end of your action days . \\nspeaking of action , where was it in this film ? \\nthere was hardly any explosions or fights . \\nthe devil made a few places explode , but arnold wasn't kicking some devil butt . \\nthe ending was changed to make it more spiritual , which undoubtedly ruined the film . \\ni was at least hoping for a cool ending if nothing else occurred , but once again i was let down . \\ni also don't know why the film took so long and cost so much . \\nthere was really no super affects at all , unless you consider an invisible devil , who was in it for 5 minutes tops , worth the overpriced budget . \\nthe budget should have gone into a better script , where at least audiences could be somewhat entertained instead of facing boredom . \\nit's pitiful to see how scripts like these get bought and made into a movie . \\ndo they even read these things anymore ? \\nit sure doesn't seem like it . \\nthankfully gabriel's performance gave some light to this poor film . \\nwhen he walks down the street searching for robin tunney , you can't help but feel that he looked like a devil . \\nthe guy is creepy looking anyway ! \\nwhen it's all over , you're just glad it's the end of the movie . \\ndon't bother to see this , if you're expecting a solid action flick , because it's neither solid nor does it have action . \\nit's just another movie that we are suckered in to seeing , due to a strategic marketing campaign . \\nsave your money and see the world is not enough for an entertaining experience . \\n\""]},"metadata":{"tags":[]},"execution_count":6}]},{"cell_type":"code","metadata":{"id":"1uRNOjFfqbot","outputId":"10073f16-e0e9-4183-9228-e82bce6d8a64"},"source":["movie.target[0]"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["0"]},"metadata":{"tags":[]},"execution_count":7}]},{"cell_type":"markdown","metadata":{"id":"Y3L8nQSHqbou"},"source":["### Split the data between train and test"]},{"cell_type":"code","metadata":{"id":"I5mdohrAqbou"},"source":["import nltk\n","from sklearn.feature_extraction.text import CountVectorizer\n","from sklearn.feature_extraction.text import TfidfTransformer\n","from sklearn.model_selection import train_test_split\n","\n","docs_train, docs_test, y_train, y_test = train_test_split(movie.data, movie.target, \n"," test_size = 0.20, random_state = 12)\n"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"h8AJdN7cqbou"},"source":["### Compute word dictionaries and word-doc frequencies matrix"]},{"cell_type":"code","metadata":{"id":"SBFj32GZqbov","outputId":"87a258fe-a37f-4c3c-f106-1c2d827a3dac"},"source":["movieVzer= CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=3000)\n","docs_train_counts = movieVzer.fit_transform(docs_train)\n","docs_test_counts = movieVzer.transform(docs_test)\n","docs_train_counts.shape"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["(1600, 3000)"]},"metadata":{"tags":[]},"execution_count":9}]},{"cell_type":"markdown","metadata":{"id":"XuOjdVjhqbov"},"source":["### TF-IDF weighting\n"]},{"cell_type":"code","metadata":{"id":"5vp4a73yqbov"},"source":["movieTfmer = TfidfTransformer()\n","docs_train_tfidf = movieTfmer.fit_transform(docs_train_counts)\n","docs_test_tfidf = movieTfmer.transform(docs_test_counts)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"H9IcWEGhqbov"},"source":["# Model Training\n","\n"," https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n"," \n"," - Parameters relevant to class inbalance: class_weight.\n"," - Parameters relevant to regularization: penalty, C.\n"," - Paremeters relevant to stop criteria: tol, max_iter.\n"]},{"cell_type":"code","metadata":{"id":"dj2vOi1gqbov"},"source":["from sklearn.linear_model import LogisticRegression\n","clf = LogisticRegression(random_state=0).fit(docs_train_tfidf, y_train)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1W0eH4HFqbow"},"source":["The model parameters are given by the coef_ variable:"]},{"cell_type":"code","metadata":{"id":"ep2qbBaXqbow","outputId":"bc07c949-8a9b-4770-ff9b-ec7570c76f6f"},"source":["print(clf.coef)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["[[-0.98246218 -0.02484443 -0.07597804 ... 0.08979332 0.07692553\n"," 0.02427724]]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"PHVzdgpaqbow"},"source":["## Evaluation\n","\n","https://scikit-learn.org/stable/modules/model_evaluation.html"]},{"cell_type":"code","metadata":{"id":"NV69B6uCqbow"},"source":["import numpy as np\n","from sklearn.metrics import classification_report\n","predict_train = clf.predict(docs_train_tfidf)\n","predicted_test = clf.predict(docs_test_tfidf)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"6_mECfl5qbow"},"source":["### Training results"]},{"cell_type":"code","metadata":{"id":"CSQr77Hfqbow","outputId":"ed7fbf42-a840-4a7c-be39-59b0ba73ad67"},"source":["target_names = ['neg', 'pos']\n","print(classification_report(predict_train, y_train, target_names=target_names))"],"execution_count":null,"outputs":[{"output_type":"stream","text":[" precision recall f1-score support\n","\n"," neg 0.90 0.91 0.90 785\n"," pos 0.91 0.90 0.91 815\n","\n"," accuracy 0.91 1600\n"," macro avg 0.91 0.91 0.91 1600\n","weighted avg 0.91 0.91 0.91 1600\n","\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"JxEvKYOnqbox"},"source":["### Test results"]},{"cell_type":"code","metadata":{"id":"AZovxTThqbox","outputId":"06d15880-de83-4d89-cf3b-dc6e02ca8c7c"},"source":["print(classification_report(predicted_test, y_test, target_names=target_names))"],"execution_count":null,"outputs":[{"output_type":"stream","text":[" precision recall f1-score support\n","\n"," neg 0.73 0.81 0.77 186\n"," pos 0.82 0.74 0.78 214\n","\n"," accuracy 0.78 400\n"," macro avg 0.78 0.78 0.77 400\n","weighted avg 0.78 0.78 0.78 400\n","\n"],"name":"stdout"}]}]}