{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3"},"colab":{"name":"RI2020_entities_lab_students.ipynb","provenance":[{"file_id":"1SxajcE0YPz-qq6HGUiDlkYs2P2MPaTEx","timestamp":1607101158870},{"file_id":"1k12usDT9BvmzPkP6T9viPkTyT23UUzP-","timestamp":1606324780304},{"file_id":"1mk6PTIpHJa7k1IMMfQL2EqxyhZXOY2GK","timestamp":1606232878279},{"file_id":"19dXRLvO_FrtOLyvaX1JAaPmOX8XGPCgy","timestamp":1605565248930}]}},"cells":[{"cell_type":"code","metadata":{"id":"YUoz0D6z-CZ3"},"source":["!pip install transformers\n","!pip install spacy\n","!python -m spacy download en_core_web_sm"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"RZdcEAAk8x3m"},"source":["# NER using Spacy\n","\n","Spacy's pretrained NER model recognizes the following categories.\n","\n"," PERSON : Denotes names of people\n"," GPE : Denotes places like counties, cities, states.\n"," ORG : Denotes organizations or companies\n"," WORK_OF_ART : Denotes titles of books, fimls,songs and other arts\n"," PRODUCT : Denotes products such as vehicles, food items ,furniture and so on.\n"," EVENT : Denotes historical events like wars, disasters ,etc…\n"," LANGUAGE : All the recognized languages across the globe.\n"," "]},{"cell_type":"code","metadata":{"id":"QwMgsYe2e692"},"source":["sample_treccast2019_topic = {\n"," \"number\": 1,\n"," \"description\": \"Considering career options for becoming a physician\\u0027s assistant vs a nurse. Discussion topics include required education (including time, cost), salaries, and which is better overall.\",\n"," \"turn\": [\n"," {\n"," \"number\": 1,\n"," \"raw_utterance\": \"What is a physician\\u0027s assistant?\"\n"," },\n"," {\n"," \"number\": 2,\n"," \"raw_utterance\": \"What are the educational requirements required to become one?\"\n"," },\n"," {\n"," \"number\": 3,\n"," \"raw_utterance\": \"What does it cost?\"\n"," },\n"," {\n"," \"number\": 4,\n"," \"raw_utterance\": \"What\\u0027s the average starting salary in the UK?\"\n"," },\n"," {\n"," \"number\": 5,\n"," \"raw_utterance\": \"What about in the US?\"\n"," },\n"," {\n"," \"number\": 6,\n"," \"raw_utterance\": \"What school subjects are needed to become a registered nurse?\"\n"," },\n"," {\n"," \"number\": 7,\n"," \"raw_utterance\": \"What is the PA average salary vs an RN?\"\n"," },\n"," {\n"," \"number\": 8,\n"," \"raw_utterance\": \"What the difference between a PA and a nurse practitioner?\"\n"," },\n"," {\n"," \"number\": 9,\n"," \"raw_utterance\": \"Do NPs or PAs make more?\"\n"," },\n"," {\n"," \"number\": 10,\n"," \"raw_utterance\": \"Is a PA above a NP?\"\n"," },\n"," {\n"," \"number\": 11,\n"," \"raw_utterance\": \"What is the fastest way to become a NP?\"\n"," },\n"," {\n"," \"number\": 12,\n"," \"raw_utterance\": \"How much longer does it take to become a doctor after being an NP?\"\n"," }\n"," ],\n"," \"title\": \"Career choice for Nursing and Physician\\u0027s Assistant\"\n"," }"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"FeaPJuFl8x4o"},"source":["passages = [\n"," \"A physician assistant in the United States, Canada and other select countries or physician associate in the United Kingdom (PA) is an Advanced Practice Provider (APP).\",\n"," \"PAs are medical professionals who diagnose illness, develop and manage treatment plans, prescribe medications, and often serve as a patient’s principal healthcare provider.\",\n"," \"Jim Kenney, the Democratic mayor of Philadelphia, has just spoken at a press conference with election officials.\",\n"," \"The input to BERT is a sequence of words, and the output is a sequence of vectors.\"\n","]"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"kle1ScWO8x30","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1606241064549,"user_tz":0,"elapsed":1469,"user":{"displayName":"Gustavo Goncalves","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiLwNLTlOO2DvPqQk-wE10aRoIPjYWBSq4eK2QGUGU=s64","userId":"14277131920487386112"}},"outputId":"a3ca16fb-d1ad-4b11-bce9-a7dba13d77d8"},"source":["import pprint\n","import spacy\n","\n","nlp = spacy.load('en_core_web_sm')\n","\n","passages_entities = [nlp(p).ents for p in passages]\n","pprint.pprint(passages_entities)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["[(the United States, Canada, the United Kingdom, APP),\n"," (),\n"," (Jim Kenney, Democratic, Philadelphia),\n"," (BERT,)]\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"sOBqXdiT-Zu6","executionInfo":{"status":"ok","timestamp":1606241369879,"user_tz":0,"elapsed":789,"user":{"displayName":"Gustavo Goncalves","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiLwNLTlOO2DvPqQk-wE10aRoIPjYWBSq4eK2QGUGU=s64","userId":"14277131920487386112"}},"outputId":"8af8c1bd-f5b2-4cdf-9a9f-ded16bb0eded"},"source":["entities_per_passage = []\n","for passage_entities in passages_entities:\n"," entities_per_passage.append([[(e.text, e.start_char, e.end_char, e.label_) for e in passage_entities]])\n","pprint.pprint(entities_per_passage)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["[[[('the United States', 25, 42, 'GPE'),\n"," ('Canada', 44, 50, 'GPE'),\n"," ('the United Kingdom', 104, 122, 'GPE'),\n"," ('APP', 162, 165, 'ORG')]],\n"," [[]],\n"," [[('Jim Kenney', 0, 10, 'PERSON'),\n"," ('Democratic', 16, 26, 'NORP'),\n"," ('Philadelphia', 36, 48, 'GPE')]],\n"," [[('BERT', 13, 17, 'ORG')]]]\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"34I2kdXLZco9","executionInfo":{"status":"ok","timestamp":1606242427950,"user_tz":0,"elapsed":809,"user":{"displayName":"Gustavo Goncalves","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GiLwNLTlOO2DvPqQk-wE10aRoIPjYWBSq4eK2QGUGU=s64","userId":"14277131920487386112"}},"outputId":"e43a5549-a9ca-430f-b1ab-68b2938ec170"},"source":["queries_entities = []\n","for i in range(len(sample_treccast2019_topic['turn'])):\n"," utterance = sample_treccast2019_topic['turn'][i]['raw_utterance']\n"," entities = nlp(utterance).ents\n"," queries_entities.append([[(e.text, e.start_char, e.end_char, e.label_) for e in entities]])\n","pprint.pprint(queries_entities)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["[[[]],\n"," [[]],\n"," [[]],\n"," [[('UK', 42, 44, 'GPE')]],\n"," [[('US', 18, 20, 'GPE')]],\n"," [[]],\n"," [[('RN', 36, 38, 'ORG')]],\n"," [[]],\n"," [[]],\n"," [[('NP', 16, 18, 'ORG')]],\n"," [[('NP', 36, 38, 'ORG')]],\n"," [[('NP', 63, 65, 'ORG')]]]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"d5agx8lcTGbS"},"source":["## Boosting Queries on ElasticSearch\n","\n","The following function provides a simple implementation to boost entities on query time to interact with the Elastic Search Index.\n","\n","**You should copy this function to your ElasticSearchSimpleAPI.py file**"]},{"cell_type":"code","metadata":{"id":"PTDAyspiTL1-"},"source":["# entities_query_template = {\"query\": {\"bool\": {\"should\": [{\"match\": {\"body\": {\"query\": \"Neverending Story\", \"boost\": 1.0}}}, {\"match\": {\"body\": \"Tell me about the Neverending Story film.\"}}]}}}\n","\n","# !pip install elasticsearch\n","# import ElasticSearchSimpleAPI as es\n","\n","def search_with_boosted_entities(query_text, entities_list, boost_list, numDocs=10):\n"," assert len(entities_list) == len(boost_list)\n"," assert len(entities_list) > 0\n"," assert isinstance(entities_list[0], str)\n"," assert isinstance(boost_list[0], (int,float))\n","\n"," entities_query_template = {\"query\": {\"bool\": {\"should\": [{\"match\": {\"body\": query_text}}]}}}\n"," boost_query_term_template = {\"match\": {\"body\": {\"query\": None, \"boost\": None}}}\n","\n"," for i in range(len(entities_list)):\n"," entity = entities_list[i]\n"," boost = boost_list[i]\n"," boost_query_term_template['match']['body']['query'] = entity\n"," boost_query_term_template['match']['body']['boost'] = boost\n"," entities_query_template[\"query\"][\"bool\"][\"should\"].append(dict(boost_query_term_template))\n"," \n"," result = elastic.client.search(index='msmarco', body=entities_query_template, size=numDocs)\n"," return json_normalize(result[\"hits\"][\"hits\"])\n","\n","search_with_boosted_entities('What are the educational requirements required to become one?', ['educational', 'requirements'], [2.0, 1.0])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Dqu62Mpc8x4Z"},"source":["## Analyzers in Elastic Search\n","\n","When working with ES, a text processing pipeline is called an Analyzer.\n","You should carefully consider the text processing pipeline when boosting enitites.\n","Here is the configuration for our index:\n","\n","\"analysis\": {\n"," \"filter\": {\n"," \"english_stemmer\": {\n"," \"type\": \"kstem\"\n"," },\n"," \"english_stop\": {\n"," \"type\": \"stop\",\n"," \"stopwords_path\": \"indri.txt\"\n"," },\n"," \"english_possessive_stemmer\": {\n"," \"type\": \"stemmer\",\n"," \"language\": \"possessive_english\"\n"," }\n"," },\n"," \"analyzer\": {\n"," \"rebuilt_english\": {\n"," \"filter\": [\n"," \"english_possessive_stemmer\",\n"," \"lowercase\",\n"," \"english_stop\",\n"," \"english_stemmer\"\n"," ],\n"," \"tokenizer\": \"standard\"\n"," }\n"," }\n","\n","From: \n","\n","{\n"," \"settings\": {\n"," \"index\": {\n"," \"search\": {\n"," \"slowlog\": {\n"," \"level\": \"trace\",\n"," \"threshold\": {\n"," \"fetch\": {\n"," \"warn\": \"1s\",\n"," \"trace\": \"0ms\",\n"," \"debug\": \"500ms\",\n"," \"info\": \"1ms\"\n"," },\n"," \"query\": {\n"," \"warn\": \"10s\",\n"," \"trace\": \"0ms\",\n"," \"debug\": \"2s\",\n"," \"info\": \"1ms\"\n"," }\n"," }\n"," }\n"," },\n"," \"number_of_shards\": \"1\",\n"," \"provided_name\": \"msmarco\",\n"," \"similarity\": {\n"," \"default\": {\n"," \"type\": \"BM25\"\n"," },\n"," \"lmd\": {\n"," \"mu\": \"1000\",\n"," \"type\": \"LMDirichlet\"\n"," }\n"," },\n"," \"creation_date\": \"1571155715084\",\n"," \"analysis\": {\n"," \"filter\": {\n"," \"english_stemmer\": {\n"," \"type\": \"kstem\"\n"," },\n"," \"english_stop\": {\n"," \"type\": \"stop\",\n"," \"stopwords_path\": \"indri.txt\"\n"," },\n"," \"english_possessive_stemmer\": {\n"," \"type\": \"stemmer\",\n"," \"language\": \"possessive_english\"\n"," }\n"," },\n"," \"analyzer\": {\n"," \"rebuilt_english\": {\n"," \"filter\": [\n"," \"english_possessive_stemmer\",\n"," \"lowercase\",\n"," \"english_stop\",\n"," \"english_stemmer\"\n"," ],\n"," \"tokenizer\": \"standard\"\n"," }\n"," }\n"," },\n"," \"number_of_replicas\": \"0\",\n"," \"uuid\": \"li5OY6rOQeuafF_g9emnuQ\",\n"," \"version\": {\n"," \"created\": \"7040099\"\n"," }\n"," }\n"," }\n","}"]},{"cell_type":"markdown","metadata":{"id":"CRZqku7f8x6f"},"source":["## More\n","\n"," - Spacy NER: (https://spacy.io/usage/spacy-101#annotations-ner)\n","\n","- Elastic Search Boolean Queries:\n"," - (https://www.elastic.co/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries)\n"," - (https://www.elastic.co/guide/en/elasticsearch/reference/7.4/query-dsl-bool-query.html)\n","\n","- Elastic Search Query Boosting:\n"," - (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html#query-dsl-terms-query)\n"," - (https://logz.io/blog/elasticsearch-queries/)\n","\n"]}]}