{"id":11523,"date":"2022-01-27T15:43:04","date_gmt":"2022-01-27T23:43:04","guid":{"rendered":"https:\/\/threecloud.wpengine.com\/?p=11523"},"modified":"2023-09-18T16:49:59","modified_gmt":"2023-09-18T23:49:59","slug":"featurizing-text-with-googles-t5-text-to-text-transformer","status":"publish","type":"post","link":"https:\/\/3cloudsolutions.com\/resources\/featurizing-text-with-googles-t5-text-to-text-transformer\/","title":{"rendered":"Featurizing text with Google\u2019s T5 Text to Text Transformer"},"content":{"rendered":"<article>\n<section>\n<div>\n<p id=\"4169\" data-selectable-paragraph=\"\">In this article we will demonstrate how to featurize text in tabular data using Google\u2019s state-of-the-art T5 Text to Text Transformer. You can follow along using the Jupyter Notebook from\u00a0<a href=\"https:\/\/github.com\/mikewcasale\/nlp_primitives\" rel=\"noopener nofollow\">this repository<\/a>.<\/p>\n<p id=\"225e\" data-selectable-paragraph=\"\">When trying to leverage real-world data in a machine learning pipeline, it is common to come across written text \u2014 for example, when predicting real estate valuations there are many numerical features, such as:<\/p>\n<ul>\n<li id=\"d094\" data-selectable-paragraph=\"\">\u201cnumber of bedrooms\u201d<\/li>\n<li id=\"df62\" data-selectable-paragraph=\"\">\u201cnumber of bathrooms\u201d<\/li>\n<li id=\"0f4d\" data-selectable-paragraph=\"\">\u201carea in sqft\u201d<\/li>\n<li id=\"6c5e\" data-selectable-paragraph=\"\">\u201clatitude\u201d<\/li>\n<li id=\"4dff\" data-selectable-paragraph=\"\">\u201clongitude\u201d<\/li>\n<li id=\"500b\" data-selectable-paragraph=\"\">&amp;etc\u2026<\/li>\n<\/ul>\n<p id=\"dd5f\" data-selectable-paragraph=\"\">But also, there are large blobs of<span id=\"rmm\">\u00a0<\/span>written text, such as found in real estate listing descriptions on sites like Zillow. This text data can include a lot of valuable information which is not otherwise accounted for in the tabular data, for example:<\/p>\n<ul>\n<li id=\"c941\" data-selectable-paragraph=\"\">mentions of an open kitchen\/floor-plan<\/li>\n<li id=\"2a79\" data-selectable-paragraph=\"\">mentions of granite counters<\/li>\n<li id=\"4ded\" data-selectable-paragraph=\"\">mentions of hardwood floors<\/li>\n<li id=\"9436\" data-selectable-paragraph=\"\">mentions of stainless steel appliances<\/li>\n<li id=\"f877\" data-selectable-paragraph=\"\">mentions of recent renovations<\/li>\n<li id=\"b219\" data-selectable-paragraph=\"\">&amp;etc\u2026<\/li>\n<\/ul>\n<p id=\"c22b\" data-selectable-paragraph=\"\">Yet, surprisingly, many AutoML tools entirely disregard this information because written text cannot be directly consumed by popular tabular algorithms, such as XGBoost.<\/p>\n<p id=\"858a\" data-selectable-paragraph=\"\">This is where\u00a0<a href=\"https:\/\/github.com\/Featuretools\/featuretools\" rel=\"noopener nofollow\">Featuretools<\/a>\u00a0primitive functions\u00a0come in. Featuretools aims to automatically create\u00a0features\u00a0for different types of data, including text, which can then be consumed by tabular machine learning models.<\/p>\n<p id=\"b380\" data-selectable-paragraph=\"\">In this article we show how to extend the\u00a0nlp-primitives library\u00a0for use with Google\u2019s state-of-the-art T5 Text to Text Transformer model, and in doing so, we create the most important NLP primitive feature, which in turn improves upon the accuracy demonstrated in the Alteryx blog\u00a0<a href=\"https:\/\/innovation.alteryx.com\/natural-language-processing-featuretools\/\" rel=\"noopener nofollow\">Natural Language Processing for Automated Feature Engineering<\/a>.<\/p>\n<figure>\n<div tabindex=\"0\" role=\"button\">\n<div><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/2000\/1*60g54I1mCfKKYv3wTgfq8A.gif\" sizes=\"auto, 700px\" srcset=\"https:\/\/miro.medium.com\/max\/345\/1*60g54I1mCfKKYv3wTgfq8A.gif 276w, https:\/\/miro.medium.com\/max\/690\/1*60g54I1mCfKKYv3wTgfq8A.gif 552w, https:\/\/miro.medium.com\/max\/800\/1*60g54I1mCfKKYv3wTgfq8A.gif 640w, https:\/\/miro.medium.com\/max\/875\/1*60g54I1mCfKKYv3wTgfq8A.gif 700w\" alt=\"\" width=\"1600\" height=\"600\" \/><\/p>\n<\/div><figcaption data-selectable-paragraph=\"\"><span style=\"font-family: arial, helvetica, sans-serif;\"><a href=\"https:\/\/arxiv.org\/abs\/1910.10683\" rel=\"noopener nofollow\"><em>Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer<\/em><\/a><\/span><\/figcaption><\/figure>\n<p id=\"a98b\" data-selectable-paragraph=\"\">For any readers unfamiliar with T5 \u2014 the T5 model was presented in Google\u2019s paper titled\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1910.10683.pdf\" rel=\"noopener nofollow\">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer<\/a>\u00a0by\u00a0Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.\u00a0Here is the abstract:<\/p>\n<p><em>Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new \u201cColossal Clean Crawled Corpus\u201d, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.<\/em><\/p>\n<\/div>\n<div>\n<h3 id=\"b091\" style=\"font-weight: bold;\"><span style=\"font-family: arial, helvetica, sans-serif; color: #666666;\">A Machine Learning Demo Featurizing Text using Hugging Face T5<\/span><\/h3>\n<p>&nbsp;<\/p>\n<div>\n<div>\n<div><img loading=\"lazy\" decoding=\"async\" style=\"margin-left: auto; margin-right: auto; display: block;\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/358\/1*T2MnnGumUGamznMk0yiyVA.png\" sizes=\"auto, 286px\" srcset=\"https:\/\/miro.medium.com\/max\/345\/1*T2MnnGumUGamznMk0yiyVA.png 276w, https:\/\/miro.medium.com\/max\/358\/1*T2MnnGumUGamznMk0yiyVA.png 286w\" alt=\"\" width=\"286\" height=\"210\" \/><\/div>\n<\/div>\n<\/div>\n<p><span style=\"font-family: arial, helvetica, sans-serif; color: #666666;\">I<\/span>mage\/logo by Hugging Face Transformers library \u2014 Transformers is a natural language processing library and a hub is now open to all ML models, with support from libraries like\u00a0<a href=\"https:\/\/github.com\/flairNLP\/flair\" rel=\"noopener nofollow\">Flair<\/a>,\u00a0<a href=\"https:\/\/github.com\/asteroid-team\/asteroid\" rel=\"noopener nofollow\">Asteroid<\/a>,\u00a0<a href=\"https:\/\/github.com\/espnet\/espnet\" rel=\"noopener nofollow\">ESPnet<\/a>,\u00a0<a href=\"https:\/\/github.com\/pyannote\/pyannote-audio\" rel=\"noopener nofollow\">Pyannote<\/a>, and more.<\/p>\n<p>&nbsp;<\/p>\n<p id=\"0526\" data-selectable-paragraph=\"\">In order to extend the NLP primitives library for use with T5, we will build two customTransformPrimitive\u00a0classes. For experimental purposes we test two approaches:<\/p>\n<ul>\n<li id=\"c00d\" data-selectable-paragraph=\"\">Fine-tuning the\u00a0<a href=\"https:\/\/huggingface.co\/t5-base\" rel=\"noopener nofollow\">Hugging Face T5-base<\/a><\/li>\n<li id=\"30ba\" data-selectable-paragraph=\"\">An off-the-shelf\u00a0<a href=\"https:\/\/huggingface.co\/mrm8488\/t5-base-finetuned-imdb-sentiment\" rel=\"noopener nofollow\">Hugging Face T5 model pre-tuned for sentiment analysis<\/a><\/li>\n<\/ul>\n<p id=\"a077\" data-selectable-paragraph=\"\">First, let\u2019s load the base model.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">\n<pre><span id=\"fe44\" data-selectable-paragraph=\"\">from simpletransformers.t5 import T5Model<\/span><span id=\"b06c\" data-selectable-paragraph=\"\">model_args = {\r\n    \"max_seq_length\": 196,\r\n    \"train_batch_size\": 8,\r\n    \"eval_batch_size\": 8,\r\n    \"num_train_epochs\": 1,\r\n    \"evaluate_during_training\": True,\r\n    \"evaluate_during_training_steps\": 15000,\r\n    \"evaluate_during_training_verbose\": True,\r\n    \"use_multiprocessing\": False,\r\n    \"fp16\": False,\r\n    \"save_steps\": -1,\r\n    \"save_eval_checkpoints\": False,\r\n    \"save_model_every_epoch\": False,\r\n    \"reprocess_input_data\": True,\r\n    \"overwrite_output_dir\": True,\r\n    \"wandb_project\": None,\r\n}<\/span><span id=\"d9f6\" data-selectable-paragraph=\"\">model = T5Model(\"t5\", \"t5-base\", args=model_args)<\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p id=\"28f3\" data-selectable-paragraph=\"\">Second, let\u2019s load the pre-tuned model.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\"><span id=\"f1ae\" style=\"font-family: arial, helvetica, sans-serif;\" data-selectable-paragraph=\"\"> model_pretuned_sentiment = T5Model(&#8216;t5&#8217;,<br \/>\n&#8216;mrm8488\/t5-base-finetuned-imdb- sentiment&#8217;,<br \/>\nuse_cuda=True)<br \/>\nmodel_pretuned_sentiment.args <\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p id=\"3217\" data-selectable-paragraph=\"\">In order to fine-tune the<span style=\"font-family: arial, helvetica, sans-serif;\"><code>t5-base<\/code><\/span>model, we need to reorganize and format the data for training.<\/p>\n<figure style=\"text-align: center;\">\n<div><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/proxy\/1*CQ98EF7xMBN6PjB_2b3ufA.png\" alt=\"png\" \/><\/div><figcaption data-selectable-paragraph=\"\"><span style=\"font-family: arial, helvetica, sans-serif;\">Original Kaggle dataset<\/span><\/figcaption><\/figure>\n<p id=\"8b8a\" data-selectable-paragraph=\"\">From the Kaggle dataset, we will map the<span style=\"font-family: arial, helvetica, sans-serif;\"><code>review_text<\/code><\/span>column to a new column called<span style=\"font-family: arial, helvetica, sans-serif;\"><code>input_text<\/code><\/span>, and we will map the<span style=\"font-family: arial, helvetica, sans-serif;\"><code>review_rating<\/code><\/span>column to a new column called\u00a0<span style=\"font-family: arial, helvetica, sans-serif;\"><code>target_text<\/code>, <\/span>meaning the<span style=\"font-family: arial, helvetica, sans-serif;\"><code>review_rating<\/code><\/span>\u00a0is what we\u2019re trying to predict. These changes conform to the Simpletransformers library interface for fine-tuning t5, whereby the main additional requirement is to specify a \u201cprefix\u201d, which is meant to assist with multi-task training (NOTE: in this example, we are focusing on a single task, so the prefix is not necessary, but nonetheless we will define it anyway for ease of use).<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">\n<pre><span id=\"dd22\" data-selectable-paragraph=\"\">dft5 = df[['review_text','review_rating']\r\n].rename({\r\n'review_text':'input_text',\r\n'review_rating':'target_text'\r\n},axis=1)<\/span><span id=\"23b1\" data-selectable-paragraph=\"\">dft5['prefix'] = ['t5-encode' for x in range(len(dft5))]<\/span><span id=\"ac15\" data-selectable-paragraph=\"\">dft5['target_text'] = dft5['target_text'].astype(str)<\/span><span id=\"c923\" data-selectable-paragraph=\"\">dft5<\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<figure style=\"text-align: center;\">\n<div><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/proxy\/1*6MgcvDh7ljoBLodmS-ObIQ.png\" alt=\"png\" \/><\/div><figcaption data-selectable-paragraph=\"\"><strong><span style=\"font-family: arial, helvetica, sans-serif;\">Output<\/span><\/strong><\/figcaption><\/figure>\n<p id=\"724f\" data-selectable-paragraph=\"\">The target text in this example is the ratings consumers gave to a given resteraunt. We can easily fine-tune the T5 model for this task by the following:<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">\n<pre><span id=\"4907\" data-selectable-paragraph=\"\">from sklearn.model_selection import train_test_split<\/span><span id=\"46ef\" data-selectable-paragraph=\"\">train_df, eval_df = train_test_split(dft5)<\/span><span id=\"a969\" data-selectable-paragraph=\"\">model.train_model(train_df, eval_data=eval_df)<\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p id=\"a021\" data-selectable-paragraph=\"\">Next, we load the pre-tuned Hugging Face model.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">model_pretuned_sentiment = T5Model(&#8216;t5&#8217;,<br \/>\n&#8216;mrm8488\/t5-base-finetuned-imdb-sentiment&#8217;,<br \/>\nuse_cuda=True)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p id=\"5c19\" data-selectable-paragraph=\"\">Let\u2019s test both models to better understand what they will predict.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"border-color: #99acc2; border-collapse: collapse; table-layout: fixed; margin-left: auto; margin-right: auto; width: 100%; border-style: solid;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">\n<pre><span id=\"0fab\" data-selectable-paragraph=\"\">test = ['Great drinks and food', \r\n     'Good food &amp;amp; beer', \r\n     'Pretty good beers']<\/span><span id=\"3c83\" data-selectable-paragraph=\"\">list(np.array(model.predict(test)).astype(float))\r\n\r\n Generating outputs:   0%|          | 0\/1 [00:00&lt;?, ?it\/s] Generating outputs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1\/1 [00:00&lt;00:00,  3.17it\/s] Generating outputs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1\/1 [00:00&lt;00:00,  3.16it\/s]  Decoding outputs:   0%|          | 0\/3 [00:00&lt;?, ?it\/s] Decoding outputs:  33%|\u2588\u2588\u2588\u258e      | 1\/3 [00:00&lt;00:01,  1.14it\/s] Decoding outputs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3\/3 [00:00&lt;00:00,  3.43it\/s] Out[14]: [4.0, 4.0, 4.0]<\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We can see that the fine-tuned model outputs a list of\u00a0<span style=\"font-family: arial, helvetica, sans-serif; color: #666666;\"><code>review_rankings<\/code><\/span>[4.0, 4.0, 4.0] which is an attempt to predict the final answer to our problem.<\/p>\n<\/div>\n<p id=\"a90e\" data-selectable-paragraph=\"\">Next, let\u2019s do a\u00a0test prediction\u00a0using the pre-tuned Hugging Face model.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">\n<pre><span id=\"3f12\" data-selectable-paragraph=\"\">test = ['Great drinks and food', \r\n     'Good food &amp;amp; beer', \r\n     'Pretty good beers']<\/span><span id=\"e03c\" data-selectable-paragraph=\"\">list(np.where(np.array(model_pretuned_sentiment.predict(test))=='positive', 1.0, 0.0))\r\n\r\n Generating outputs:   0%|          | 0\/1 [00:00&lt;?, ?it\/s] Generating outputs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1\/1 [00:00&lt;00:00,  7.57it\/s] Generating outputs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1\/1 [00:00&lt;00:00,  7.56it\/s]  Decoding outputs:   0%|          | 0\/3 [00:00&lt;?, ?it\/s] Decoding outputs:  33%|\u2588\u2588\u2588\u258e      | 1\/3 [00:00&lt;00:01,  1.17it\/s] Decoding outputs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3\/3 [00:00&lt;00:00,  3.50it\/s] Out[15]: [1.0, 1.0, 1.0]<\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p id=\"596a\" data-selectable-paragraph=\"\">Note that the pre-tuned model outputs a list of boolean True\/False values which indicate whether a statement was\u00a0<span style=\"font-family: arial, helvetica, sans-serif;\"><code>positive<\/code><\/span>or<span style=\"font-family: arial, helvetica, sans-serif;\"><code>negative<\/code><\/span>\u2014 we convert these into float values for better integration with tabular modeling. In this case, all values are true, so the output becomes [1.0, 1.0, 1.0].<\/p>\n<p id=\"ab40\" data-selectable-paragraph=\"\">Now that we\u2019ve loaded our two versions of T5 we can build<span style=\"font-family: arial, helvetica, sans-serif;\"><code>TransformPrimitive<\/code><\/span>\u00a0classes which will integrate with the NLP Primitives and Featuretools libraries.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">from featuretools.primitives.base import TransformPrimitive<br \/>\nfrom featuretools.variable_types import Numeric, Text<\/p>\n<p>class T5Encoder(TransformPrimitive):<\/p>\n<p>name = &#8220;t5_encoder&#8221;<br \/>\ninput_types = [Text]<br \/>\nreturn_type = Numeric<br \/>\ndefault_value = 0<\/p>\n<p>def __init__(self, model=model):<br \/>\nself.model = model<\/p>\n<p>def get_function(self):<\/p>\n<p>def t5_encoder(x):<br \/>\nmodel.args.use_multiprocessing = True<br \/>\nreturn list(np.array(model.predict(x.tolist())).astype(float))<br \/>\nreturn t5_encoder<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p id=\"3917\" data-selectable-paragraph=\"\">The above code creates a new class called<span style=\"font-family: arial, helvetica, sans-serif;\"><code>T5Encoder<\/code><\/span>\u00a0which will use the\u00a0fine-tuned\u00a0T5 model, and the below code creates a new class calle<span style=\"font-family: arial, helvetica, sans-serif;\">d\u00a0<code>T5SentimentEncoder<\/code><\/span>which will use the\u00a0pre-tuned\u00a0T5 model.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">class T5SentimentEncoder(TransformPrimitive):<\/p>\n<p>name = &#8220;t5_sentiment_encoder&#8221;<br \/>\ninput_types = [Text]<br \/>\nreturn_type = Numeric<br \/>\ndefault_value = 0<\/p>\n<p>def __init__(self, model=model_pretuned_sentiment):<br \/>\nself.model = model<\/p>\n<p>def get_function(self):<\/p>\n<p>def t5_sentiment_encoder(x):<br \/>\nmodel.args.use_multiprocessing = True<br \/>\nreturn list(np.where(np.array(model_pretuned_sentiment.predict(x.tolist()))==&#8217;positive&#8217;,1.0,0.0))<br \/>\nreturn t5_sentiment_encoder<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>Featuretools will now know how to use T5 to featurize text columns, and it will even calculate aggregates using the T5 output, or perform operations with it, such as subtracting the value from other features. Having defined these new classes, we simply roll them up in the required Featuretools format along with the default classes, which will make them available for use with automated feature engineering.<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">trans = [<br \/>\nT5Encoder,<br \/>\nT5SentimentEncoder,<br \/>\nDiversityScore,<br \/>\nLSA,<br \/>\nMeanCharactersPerWord,<br \/>\nPartOfSpeechCount,<br \/>\nPolarityScore,<br \/>\nPunctuationCount,<br \/>\nStopwordCount,<br \/>\nTitleWordCount,<br \/>\nUniversalSentenceEncoder,<br \/>\nUpperCaseCount<br \/>\n]<\/p>\n<p>ignore = {&#8216;restaurants&#8217;: [&#8216;rating&#8217;],<br \/>\n&#8216;reviews&#8217;: [&#8216;review_rating&#8217;]}<\/p>\n<p>drop_contains = [&#8216;(reviews.UNIVERSAL&#8217;]<\/p>\n<p>features = ft.dfs(entityset=es,<br \/>\ntarget_entity=&#8217;reviews&#8217;,<br \/>\ntrans_primitives=trans,<br \/>\nverbose=True,<br \/>\nfeatures_only=True,<br \/>\nignore_variables=ignore,<br \/>\ndrop_contains=drop_contains,<br \/>\nmax_depth=4)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>As you can see in the output below, the Featuretools library is very powerful! In fact, in addition to the T5 features shown here, it also created hundreds more using all of the other NLP primitives specified, pretty cool!<\/p>\n<div style=\"overflow-x: auto; max-width: 100%; width: 100%; margin-left: auto; margin-right: auto;\" data-hs-responsive-table=\"true\">\n<table style=\"width: 100%; border-color: #99acc2; border-style: solid; border-collapse: collapse; table-layout: fixed;\" border=\"1\" cellpadding=\"4\">\n<tbody>\n<tr>\n<td style=\"width: 100%; background-color: #b6b8ba;\">\n<pre><span style=\"font-family: arial, helvetica, sans-serif;\"><span id=\"9dc9\" data-selectable-paragraph=\"\">feature_matrix = ft.calculate_feature_matrix(features=features,\r\n                                             entityset=es,\r\n                                             verbose=True)<\/span><span id=\"b105\" data-selectable-paragraph=\"\">features<\/span><\/span><\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<h3 id=\"cfea\" style=\"font-weight: normal;\"><span style=\"font-family: arial, helvetica, sans-serif;\">Out[20]:<\/span><\/h3>\n<ul>\n<li id=\"a28a\" data-selectable-paragraph=\"\">&lt;Feature: T5_ENCODER(review_title)&gt;<\/li>\n<li id=\"d2af\" data-selectable-paragraph=\"\">&lt;Feature: T5_SENTIMENT_ENCODER(review_title)&gt;<\/li>\n<li id=\"8f79\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.MAX(reviews.T5_ENCODER(review_title))&gt;<\/li>\n<li id=\"2af1\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.MAX(reviews.T5_SENTIMENT_ENCODER(review_title))&gt;<\/li>\n<li id=\"62f6\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.MEAN(reviews.T5_ENCODER(review_title))&gt;<\/li>\n<li id=\"5bcb\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.MEAN(reviews.T5_SENTIMENT_ENCODER(review_title))&gt;<\/li>\n<li id=\"07a9\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.MIN(reviews.T5_ENCODER(review_title))&gt;<\/li>\n<li id=\"f710\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.MIN(reviews.T5_SENTIMENT_ENCODER(review_title))&gt;<\/li>\n<li id=\"ffb2\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.SKEW(reviews.T5_ENCODER(review_title))&gt;<\/li>\n<li id=\"9fbd\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.SKEW(reviews.T5_SENTIMENT_ENCODER(review_title))&gt;<\/li>\n<li id=\"85c5\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.STD(reviews.T5_ENCODER(review_title))&gt;<\/li>\n<li id=\"81e5\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.STD(reviews.T5_SENTIMENT_ENCODER(review_title))&gt;<\/li>\n<li id=\"9ca2\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.SUM(reviews.T5_ENCODER(review_title))&gt;<\/li>\n<li id=\"91af\" data-selectable-paragraph=\"\">&lt;Feature: restaurants.SUM(reviews.T5_SENTIMENT_ENCODER(review_title))&gt;<\/li>\n<\/ul>\n<\/div>\n<\/section>\n<\/article>\n<h3 id=\"2859\" style=\"font-weight: bold;\"><span style=\"font-family: arial, helvetica, sans-serif; color: #666666;\">Machine Learning<\/span><\/h3>\n<p id=\"f576\" data-selectable-paragraph=\"\">Now we create and test various machine learning models from sklearn using the feature matrix which includes the newly created T5 primitives.<\/p>\n<p id=\"ae7e\" data-selectable-paragraph=\"\">As a reminder, we are going to be comparing the T5 enhanced accuracy against the accuracy demonstrated in the Alteryx blog\u00a0<a href=\"https:\/\/innovation.alteryx.com\/natural-language-processing-featuretools\/\" rel=\"noopener nofollow\">Natural Language Processing for Automated Feature Engineering<\/a>.<\/p>\n<h3 id=\"1927\" style=\"font-weight: bold;\"><span style=\"font-family: arial, helvetica, sans-serif; color: #666666;\">Using Logistic Regression:<\/span><\/h3>\n<article>\n<section>\n<figure>\n<div tabindex=\"0\" role=\"button\">\n<div>\n<div>\n<div><img loading=\"lazy\" decoding=\"async\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/1125\/1*jubleZDZr-k6IAyeJB9wNg.png\" sizes=\"auto, 700px\" srcset=\"https:\/\/miro.medium.com\/max\/345\/1*jubleZDZr-k6IAyeJB9wNg.png 276w, https:\/\/miro.medium.com\/max\/690\/1*jubleZDZr-k6IAyeJB9wNg.png 552w, https:\/\/miro.medium.com\/max\/800\/1*jubleZDZr-k6IAyeJB9wNg.png 640w, https:\/\/miro.medium.com\/max\/875\/1*jubleZDZr-k6IAyeJB9wNg.png 700w\" alt=\"\" width=\"900\" height=\"180\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/figure>\n<figure>\n<div tabindex=\"0\" role=\"button\"><\/div>\n<div tabindex=\"0\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/1113\/1*4XiZ0e7L0tJgiQLZ5YT3qA.png\" sizes=\"auto, 700px\" srcset=\"https:\/\/miro.medium.com\/max\/345\/1*4XiZ0e7L0tJgiQLZ5YT3qA.png 276w, https:\/\/miro.medium.com\/max\/690\/1*4XiZ0e7L0tJgiQLZ5YT3qA.png 552w, https:\/\/miro.medium.com\/max\/800\/1*4XiZ0e7L0tJgiQLZ5YT3qA.png 640w, https:\/\/miro.medium.com\/max\/875\/1*4XiZ0e7L0tJgiQLZ5YT3qA.png 700w\" alt=\"\" width=\"890\" height=\"176\" \/><\/div>\n<\/figure>\n<p id=\"52a9\" data-selectable-paragraph=\"\">Note that the 0.64 Logistic Regression score above shows an improvement over the Featuretools native Logistic Regression score, which was 0.63.<\/p>\n<h3 id=\"4797\" style=\"font-weight: bold;\"><span style=\"font-family: arial, helvetica, sans-serif;\">Using\u00a0Random Forest Classifier:<\/span><\/h3>\n<figure>\n<div tabindex=\"0\" role=\"button\"><img loading=\"lazy\" decoding=\"async\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/1203\/1*TWYZRGbT5580cx0Xwiybww.png\" sizes=\"auto, 700px\" srcset=\"https:\/\/miro.medium.com\/max\/345\/1*TWYZRGbT5580cx0Xwiybww.png 276w, https:\/\/miro.medium.com\/max\/690\/1*TWYZRGbT5580cx0Xwiybww.png 552w, https:\/\/miro.medium.com\/max\/800\/1*TWYZRGbT5580cx0Xwiybww.png 640w, https:\/\/miro.medium.com\/max\/875\/1*TWYZRGbT5580cx0Xwiybww.png 700w\" alt=\"\" width=\"962\" height=\"192\" \/><\/div>\n<\/figure>\n<figure>\n<div tabindex=\"0\" role=\"button\">\n<div><\/div>\n<p><img loading=\"lazy\" decoding=\"async\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/1210\/1*lUiXcjniERfcGZMQCK1mtg.png\" sizes=\"auto, 700px\" srcset=\"https:\/\/miro.medium.com\/max\/345\/1*lUiXcjniERfcGZMQCK1mtg.png 276w, https:\/\/miro.medium.com\/max\/690\/1*lUiXcjniERfcGZMQCK1mtg.png 552w, https:\/\/miro.medium.com\/max\/800\/1*lUiXcjniERfcGZMQCK1mtg.png 640w, https:\/\/miro.medium.com\/max\/875\/1*lUiXcjniERfcGZMQCK1mtg.png 700w\" alt=\"\" width=\"968\" height=\"180\" \/><\/p>\n<\/div>\n<\/figure>\n<p id=\"2d0a\" data-selectable-paragraph=\"\">Note that the T5 enhanced 0.65 Random Forest Classifier score above shows an improvement over the Featuretools native Random Forest Classifier score, which was 0.64.<\/p>\n<h3 id=\"823a\"><span style=\"font-family: arial, helvetica, sans-serif; color: #666666;\"><strong>Random Forest Classifier Feature Importance<\/strong><\/span><\/h3>\n<p id=\"181c\" data-selectable-paragraph=\"\">We can attribute the improved score to the new T5 primitives using the sklearn Random Forest Classifier feature importance.<\/p>\n<h3 id=\"baae\" style=\"font-weight: normal;\"><span style=\"font-family: arial, helvetica, sans-serif;\">Out[30]:<\/span><\/h3>\n<figure>\n<div><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/proxy\/1*GRwlsgKC36UQODgwFgaLlA.png\" alt=\"png\" \/><\/div>\n<\/figure>\n<figure>\n<div><img decoding=\"async\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/proxy\/\" alt=\"\" \/><\/div>\n<\/figure>\n<p id=\"d976\" data-selectable-paragraph=\"\">From the above table we can see that the highest feature importance of the Random Forest model is the newly created feature<\/p>\n<p data-selectable-paragraph=\"\"><span style=\"font-family: arial, helvetica, sans-serif;\"><strong>T5_SENTIMENT_ENCODER(review_title)!<\/strong><\/span><\/p>\n<figure style=\"text-align: center;\">\n<div><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/proxy\/1*EDX9uno4ihH7BIsVMi5-xQ.png\" alt=\"png\" \/><\/div><figcaption data-selectable-paragraph=\"\"><span style=\"font-family: arial, helvetica, sans-serif;\">Random Forest Classifier feature importance, Image by author<\/span><\/figcaption><\/figure>\n<h3 id=\"10c6\" style=\"font-weight: bold;\"><\/h3>\n<figure>\n<div tabindex=\"0\" role=\"button\">\n<p>&nbsp;<\/p>\n<\/div>\n<\/figure>\n<\/section>\n<\/article>\n","protected":false},"excerpt":{"rendered":"<p>In this article we will demonstrate how to featurize text in tabular data using Google\u2019s&mldr;<\/p>\n","protected":false},"author":70,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[260],"tags":[429],"class_list":["post-11523","post","type-post","status-publish","format-standard","hentry","category-data-ai","tag-data-and-ai","topics-blog"],"acf":[],"_links":{"self":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/11523","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/users\/70"}],"replies":[{"embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/comments?post=11523"}],"version-history":[{"count":0,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/posts\/11523\/revisions"}],"wp:attachment":[{"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/media?parent=11523"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/categories?post=11523"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/3cloudsolutions.com\/wp-json\/wp\/v2\/tags?post=11523"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}