Practical 9: Applications of text mining and NLP¶
Developed by Javier Garcia-Bernardo, Anastasia Giachanou. Updated by Pablo Mosteiro.
Applied Text Mining - Utrecht Summer School
In this practical you will be answering a research question or solving a problem. For that you will create a pipeline for classification or clustering.
All the data is processed and can be found on the github repository.
Here are some proposed research questions:
Classification¶
RQ1: Identification of fake news, hate speech or spam + Interpretability of results:¶
Data:
https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset or
https://github.com/aitor-garcia-p/hate-speech-dataset (https://paperswithcode.com/dataset/hate-speech) or
https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection
Goal: Evaluate performance of different methods and interpret the results using LIME
RQ2: Evaluate the importance of metadata. Create a classification system to identify the movie genre using and excluding metadata:¶
Data: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
Options:
Create two classifications systems, one using only metadata, one using only text. Stack them to create the best model: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
Use the functional API of Keras to create one model that handles both types of inputs: https://pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
Goal: Evaluate performance and interpret the results using LIME
Clustering:¶
RQ3: Create a recommendation system for movies based on their plot:¶
Data: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
Output: What are the closest movies to "The Shawshank Redemption", "Goodfellas", and "Harry Potter and the Sorcerer's Stone"?
RQ4: Cluster headlines using word embeddings:¶
Data: https://www.ims.uni-stuttgart.de/en/research/resources/corpora/goodnewseveryone/ (https://aclanthology.org/2020.lrec-1.194.pdf)
Do the clusters correlate to emotions or media sources? You can come up with your own research question using any dataset on text analysis, e.g. from:
UCI repository: https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=text&sort=nameUp&view=table
Papers with code repository: https://paperswithcode.com/datasets?mod=texts&page=1
Kaggle (code examples are often included): https://www.kaggle.com/datasets?tags=13204-NLP (but given the time restrictions, choosing one of the above is recommended)
# path to the data
path_data = "./"
# Data wrangling
import pandas as pd
import numpy as np
# Machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
# Interpretable AI
!pip install lime
from lime.lime_text import LimeTextExplainer
# data_rq1_fake = pd.read_csv("rq1_fake_news.csv.gzip",sep="\t",compression="gzip")
# data_rq1_hate_speech = pd.read_csv("rq1_hate_speech.csv.gzip",sep="\t",compression="gzip")
# data_rq1_youtube = pd.read_csv("rq1_youtube.csv.gzip",sep="\t",compression="gzip")
# data_rq2_3 = pd.read_csv("rq2_3_wiki_movie_plots.csv.gzip",sep="\t",compression="gzip")
# data_rq4 = pd.read_csv("rq4_gne-release-v1.0.csv.gzip",sep="\t",compression="gzip")
# data_rq1_fake.shape, data_rq1_hate_speech.shape, data_rq1_youtube.shape, data_rq2_3.shape, data_rq4.shape
Requirement already satisfied: lime in c:\python\python311\lib\site-packages (0.2.0.1) Requirement already satisfied: matplotlib in c:\python\python311\lib\site-packages (from lime) (3.8.2) Requirement already satisfied: numpy in c:\python\python311\lib\site-packages (from lime) (1.24.1) Requirement already satisfied: scipy in c:\python\python311\lib\site-packages (from lime) (1.11.4) Requirement already satisfied: tqdm in c:\python\python311\lib\site-packages (from lime) (4.66.1) Requirement already satisfied: scikit-learn>=0.18 in c:\python\python311\lib\site-packages (from lime) (1.3.2) Requirement already satisfied: scikit-image>=0.12 in c:\python\python311\lib\site-packages (from lime) (0.24.0) Requirement already satisfied: networkx>=2.8 in c:\python\python311\lib\site-packages (from scikit-image>=0.12->lime) (3.0) Requirement already satisfied: pillow>=9.1 in c:\python\python311\lib\site-packages (from scikit-image>=0.12->lime) (9.3.0) Requirement already satisfied: imageio>=2.33 in c:\python\python311\lib\site-packages (from scikit-image>=0.12->lime) (2.34.2) Requirement already satisfied: tifffile>=2022.8.12 in c:\python\python311\lib\site-packages (from scikit-image>=0.12->lime) (2024.7.2) Requirement already satisfied: packaging>=21 in c:\python\python311\lib\site-packages (from scikit-image>=0.12->lime) (23.2) Requirement already satisfied: lazy-loader>=0.4 in c:\python\python311\lib\site-packages (from scikit-image>=0.12->lime) (0.4) Requirement already satisfied: joblib>=1.1.1 in c:\python\python311\lib\site-packages (from scikit-learn>=0.18->lime) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\python\python311\lib\site-packages (from scikit-learn>=0.18->lime) (3.2.0) Requirement already satisfied: contourpy>=1.0.1 in c:\python\python311\lib\site-packages (from matplotlib->lime) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\python\python311\lib\site-packages (from matplotlib->lime) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\python\python311\lib\site-packages (from matplotlib->lime) (4.45.1) Requirement already satisfied: kiwisolver>=1.3.1 in c:\python\python311\lib\site-packages (from matplotlib->lime) (1.4.5) Requirement already satisfied: pyparsing>=2.3.1 in c:\python\python311\lib\site-packages (from matplotlib->lime) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in c:\python\python311\lib\site-packages (from matplotlib->lime) (2.8.2) Requirement already satisfied: colorama in c:\python\python311\lib\site-packages (from tqdm->lime) (0.4.6) Requirement already satisfied: six>=1.5 in c:\python\python311\lib\site-packages (from python-dateutil>=2.7->matplotlib->lime) (1.16.0)
[notice] A new release of pip is available: 23.2.1 -> 24.1.2 [notice] To update, run: python.exe -m pip install --upgrade pip
RQ1: Identification of hate speech¶
Data on hate speech: https://github.com/aitor-garcia-p/hate-speech-dataset (https://paperswithcode.com/dataset/hate-speech) Data on fake vs real news: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset Data on youtube spam messages: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection We provide code for the first dataset. Your goal is to improve the classifier by using a more advanced method
Data: Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not
Step 1: Read data and create train-test split¶
df = pd.read_csv(f"{path_data}/rq1_hate_speech.csv.gzip",sep="\t",compression="gzip", index_col=0)
df["label"] = df["label"].map({"hate": 1, "noHate": 0})
df = df[["text","label"]]
df = df.dropna()
print(df.shape)
df.head()
(10703, 2)
text | label | |
---|---|---|
file_id | ||
12834217_1 | As of March 13th , 2014 , the booklet had been... | 0.0 |
12834217_2 | In order to help increase the booklets downloa... | 0.0 |
12834217_3 | ( Simply copy and paste the following text int... | 0.0 |
12834217_4 | Click below for a FREE download of a colorfull... | 1.0 |
12834217_5 | Click on the `` DOWNLOAD ( 7.42 MB ) '' green ... | 0.0 |
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(df["text"].values, df["label"].values, test_size=0.33, random_state=42)
Step 2: Create pipeline and hyperparameter tuning¶
Create a pipeline that vectorizes the text and transform it using TF-IDF, and classifies the news titles using LogisticRegression.
# Pipeline
pipe = Pipeline([
('vectorizer', TfidfVectorizer(stop_words='english', #remove stopwords
lowercase=True, #convert to lowercase
token_pattern=r'(?u)\b[A-Za-z][A-Za-z]+\b')), #tokens of at least 2 characters
('clf', LogisticRegression(max_iter=10000, dual=False, solver="saga")) #logistic regression
])
# Parameters to hyptertune
param_grid = dict(vectorizer__ngram_range=[(1,1), (1,2), (1,3)], # creation of n-grams
vectorizer__min_df=[1, 10, 100], # minimum support for words
clf__C=[0.1, 1, 10, 100], # regularization
clf__penalty=["l2","l1"]) # type of regularization
# Run a grid search using cross-validation to find the best parameters
grid_search = GridSearchCV(pipe, param_grid=param_grid, verbose=True, n_jobs=-1)
# to speed it up we find the hyperparameters using a sample, and fit on the entire datast later
grid_search.fit(X_train[:1000], y_train[:1000])
# best parameters, score and estimator
print(grid_search.best_params_)
print(grid_search.best_score_)
Fitting 5 folds for each of 72 candidates, totalling 360 fits {'clf__C': 10, 'clf__penalty': 'l2', 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2)} 0.893
C:\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py:425: FitFailedWarning: 120 fits failed out of a total of 360. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 120 fits failed with the following error: Traceback (most recent call last): File "C:\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Python\Python311\Lib\site-packages\sklearn\base.py", line 1152, in wrapper return fit_method(estimator, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 423, in fit Xt = self._fit(X, y, **fit_params_steps) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 377, in _fit X, fitted_transformer = fit_transform_one_cached( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\joblib\memory.py", line 353, in __call__ return self.func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 957, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\sklearn\feature_extraction\text.py", line 2139, in fit_transform X = super().fit_transform(raw_documents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\sklearn\base.py", line 1152, in wrapper return fit_method(estimator, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\sklearn\feature_extraction\text.py", line 1402, in fit_transform X, self.stop_words_ = self._limit_features( ^^^^^^^^^^^^^^^^^^^^^ File "C:\Python\Python311\Lib\site-packages\sklearn\feature_extraction\text.py", line 1254, in _limit_features raise ValueError( ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df. warnings.warn(some_fits_failed_message, FitFailedWarning) C:\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:979: UserWarning: One or more of the test scores are non-finite: [0.892 0.892 0.892 0.892 0.892 0.892 nan nan nan 0.892 0.892 0.892 0.892 0.892 0.892 nan nan nan 0.892 0.892 0.892 0.891 0.891 0.891 nan nan nan 0.889 0.892 0.892 0.889 0.889 0.889 nan nan nan 0.891 0.893 0.891 0.884 0.884 0.884 nan nan nan 0.882 0.866 0.851 0.883 0.883 0.883 nan nan nan 0.885 0.891 0.892 0.882 0.882 0.882 nan nan nan 0.88 0.873 0.861 0.88 0.88 0.88 nan nan nan] warnings.warn(
# print resutls
results = pd.DataFrame(grid_search.cv_results_)
results.sort_values(by="mean_test_score", ascending=False).head(10)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_clf__C | param_clf__penalty | param_vectorizer__min_df | param_vectorizer__ngram_range | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37 | 0.186329 | 0.002368 | 0.008961 | 0.002774 | 10 | l2 | 1 | (1, 2) | {'clf__C': 10, 'clf__penalty': 'l2', 'vectoriz... | 0.895 | 0.900 | 0.895 | 0.885 | 0.89 | 0.893 | 0.005099 | 1 |
0 | 0.084127 | 0.012519 | 0.012037 | 0.004752 | 0.1 | l2 | 1 | (1, 1) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
1 | 0.144696 | 0.008865 | 0.011563 | 0.001178 | 0.1 | l2 | 1 | (1, 2) | {'clf__C': 0.1, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
28 | 0.166744 | 0.055881 | 0.008554 | 0.004299 | 1 | l1 | 1 | (1, 2) | {'clf__C': 1, 'clf__penalty': 'l1', 'vectorize... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
56 | 0.485185 | 0.032435 | 0.009688 | 0.001688 | 100 | l2 | 1 | (1, 3) | {'clf__C': 100, 'clf__penalty': 'l2', 'vectori... | 0.895 | 0.885 | 0.905 | 0.885 | 0.89 | 0.892 | 0.007483 | 2 |
20 | 0.200785 | 0.029707 | 0.011157 | 0.002752 | 1 | l2 | 1 | (1, 3) | {'clf__C': 1, 'clf__penalty': 'l2', 'vectorize... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
19 | 0.138288 | 0.009335 | 0.008347 | 0.004239 | 1 | l2 | 1 | (1, 2) | {'clf__C': 1, 'clf__penalty': 'l2', 'vectorize... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
18 | 0.089412 | 0.003192 | 0.009061 | 0.001833 | 1 | l2 | 1 | (1, 1) | {'clf__C': 1, 'clf__penalty': 'l2', 'vectorize... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
14 | 0.053019 | 0.006905 | 0.005858 | 0.005593 | 0.1 | l1 | 10 | (1, 3) | {'clf__C': 0.1, 'clf__penalty': 'l1', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
13 | 0.051075 | 0.019864 | 0.006983 | 0.006018 | 0.1 | l1 | 10 | (1, 2) | {'clf__C': 0.1, 'clf__penalty': 'l1', 'vectori... | 0.895 | 0.895 | 0.890 | 0.890 | 0.89 | 0.892 | 0.002449 | 2 |
# Use the best parameters in the pipe and fit with the entire dataset
pipe = pipe.set_params(**grid_search.best_params_)
clf_best = pipe.fit(X_train, y_train)
# print vocabulary size
print(len(clf_best["vectorizer"].get_feature_names_out()))
#vocabulary
#clf_best["vectorizer"].vocabulary_
# the best score achieved
print(clf_best.score(X_train, y_train))
# the best score achieved
print(clf_best.score(X_test, y_test))
53376 0.9993027471761261 0.8955266138165345
# Add predicitons to dataframe
df["predicted"] = clf_best.predict(df["text"])
df["predicted_prob_hate"] = clf_best.predict_proba(df["text"])[:,1]
df
text | label | predicted | predicted_prob_hate | |
---|---|---|---|---|
file_id | ||||
12834217_1 | As of March 13th , 2014 , the booklet had been... | 0.0 | 0.0 | 0.017500 |
12834217_2 | In order to help increase the booklets downloa... | 0.0 | 0.0 | 0.018843 |
12834217_3 | ( Simply copy and paste the following text int... | 0.0 | 0.0 | 0.012737 |
12834217_4 | Click below for a FREE download of a colorfull... | 1.0 | 1.0 | 0.692457 |
12834217_5 | Click on the `` DOWNLOAD ( 7.42 MB ) '' green ... | 0.0 | 0.0 | 0.016726 |
... | ... | ... | ... | ... |
33676864_5 | Billy - `` That guy would n't leave me alone ,... | 0.0 | 0.0 | 0.057325 |
33677019_1 | Wish we at least had a Marine Le Pen to vote f... | 0.0 | 0.0 | 0.048629 |
33677019_2 | Its like the choices are white genocide candid... | 0.0 | 0.0 | 0.040036 |
33677053_1 | Why White people used to say that sex was a si... | 1.0 | 0.0 | 0.113073 |
33677053_2 | Now I get it ! | 0.0 | 0.0 | 0.042859 |
10703 rows × 4 columns
Step 3: Interpretation of results¶
Interpretation of coefficients in the linear model¶
We can use the coefficients of the Logistic regression
# Extract the coeficients from the omdel
coefs = pd.DataFrame([clf_best["vectorizer"].get_feature_names_out(),
clf_best["clf"].coef_[0]]).T
coefs.columns = ["gram","coef"]
# top words influencing hate
display(coefs.sort_values(by="coef", ascending=False).head(10))
# top words influencing non-hate
display(coefs.sort_values(by="coef", ascending=True).head(10))
gram | coef | |
---|---|---|
31434 | negroes | 8.127696 |
4218 | black | 7.271432 |
26064 | liberals | 6.233119 |
18671 | groid | 6.17585 |
15506 | filth | 6.049205 |
40573 | scum | 6.044036 |
1723 | ape | 5.952632 |
1732 | apes | 4.901957 |
686 | africa | 4.818615 |
30570 | mud | 4.783214 |
gram | coef | |
---|---|---|
53076 | youtube | -3.433101 |
39652 | said | -2.727386 |
52725 | year | -2.445913 |
46099 | thanks | -2.220604 |
46670 | thread | -2.20559 |
30729 | music | -2.144751 |
27704 | lot | -2.081298 |
8422 | comes | -2.037127 |
19172 | hair | -2.014378 |
30983 | nationalist | -1.956421 |
Interpretation of coefficients using LIME (Local Interpretable Model-Agnostic Explanations)¶
LIME modifies the text to understand the impact of each word to the predictions.
# Find some extreme examples
df_confused = df.loc[df["label"] != df["predicted"]]
pred_hate_not_hate = (df_confused.loc[df_confused["label"]==0].sort_values(by="predicted_prob_hate").tail(1).values[0][0])
pred_not_hate_hate = df_confused.loc[df_confused["label"]==1].sort_values(by="predicted_prob_hate").head(1).values[0][0]
print("Here")
less_hate = df.sort_values(by="predicted_prob_hate").head(1).values[0][0]
most_hate = df.sort_values(by="predicted_prob_hate").tail(1).values[0][0]
pred_50_50 = "She says the class is out of control and the kids are unteachable , and the black administration does not support her "
print("Least hate: ", less_hate)
print("Most hate: ", most_hate)
print("Predicted very hate but not hateful: ", pred_hate_not_hate)
print("Predicted very innocuous but hateful: ", pred_not_hate_hate)
print("Predicted 50/50: ", pred_50_50)
Here Least hate: - YouTube Most hate: Look what happens when Whites leave black Countries alone to do what they do naturally The blacks in White Countries today should be on their knees thanking Whites for trying to civilize them Predicted very hate but not hateful: Too many whites think they deserve what negroes dish out because of guilt . Predicted very innocuous but hateful: https://www.stormfront.org/forum/t1020784/ https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden God save them ..... Predicted 50/50: She says the class is out of control and the kids are unteachable , and the black administration does not support her
# start the explainer
explainer = LimeTextExplainer(class_names = ["Innocuous", "Hateful"], bow=False)
# shows the explanation for our example instances
for text in [less_hate, most_hate, pred_hate_not_hate, pred_not_hate_hate, pred_50_50]:
exp = explainer.explain_instance(text,
clf_best.predict_proba,
num_features = 10,
num_samples = 1000)
exp.show_in_notebook(text=text)
print(exp.as_list())
print("-"*100)
[('YouTube', -0.008309710769313722)] ----------------------------------------------------------------------------------------------------
[('black', 0.14189904187020813), ('leave', 0.10066344815500812), ('Whites', 0.08548512608496868), ('blacks', 0.07472912947281263), ('Whites', 0.07311118237588993), ('Countries', 0.05216177835789108), ('Countries', 0.04932279484359763), ('today', -0.04590198911408145), ('knees', -0.045451196415093736), ('Look', -0.020692759209158997)] ----------------------------------------------------------------------------------------------------
[('negroes', 0.48984841287916525), ('whites', 0.17413391161180755), ('guilt', 0.047217196521942804), ('think', -0.020514170968904285), ('because', 0.009553107234928164), ('Too', 0.006875133294656022), ('many', 0.005661637921179765), ('deserve', -0.004755165144056283), ('dish', 0.004453272388944947), ('they', -0.0035315165694286676)] ----------------------------------------------------------------------------------------------------
[('www', -0.003850736844586429), ('www', -0.0025338757391648054), ('sweden', 0.001115389489276485), ('sweden', 0.0010925711903717315), ('sweden', 0.000996213830939601), ('sweden', 0.0009284376917061511), ('sweden', 0.0008663975352523936), ('God', 0.000644032004940194), ('sweden', 0.0005570123378235674), ('sweden', 0.0005318687978205907)] ----------------------------------------------------------------------------------------------------
[('black', 0.4147987450170454), ('control', 0.1764216962161887), ('administration', -0.13880103394457297), ('class', -0.08920177787894927), ('kids', -0.06011134491380739), ('does', -0.04766381688982231), ('says', 0.015229639747932996), ('support', 0.00803117611427855), ('not', 0.005193775609816279), ('and', 0.004636384478571871)] ----------------------------------------------------------------------------------------------------
exp = explainer.explain_instance("I believe Dutch people have inferior food and they should be colonized by Belgium",
clf_best.predict_proba,
num_features = 10,
num_samples = 1000)
exp.show_in_notebook(text=text)
print(exp.as_list())
print("-"*100)
[('people', -0.009147356379924244), ('food', 0.008588941565502791), ('inferior', 0.006945309702776044), ('Belgium', -0.0036581371043646867), ('believe', -0.0013060459325008474), ('and', 0.00029652599826202346), ('should', -0.00016267076063996192), ('they', -0.00012030062852876019), ('colonized', 0.00011655850035562251), ('by', -9.253762015795914e-05)] ----------------------------------------------------------------------------------------------------
Now it's your turn.¶
Either:
Adapt RQ1 using different models (e.g. a CNN, as shown below) or data (either the ones described under RQ1, or any other)
Or start on a different RQ
!pip install scikeras
!pip install Keras-Preprocessing
from scikeras.wrappers import KerasClassifier
#from keras.wrappers.scikit_learn import KerasClassifier
from keras_preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras import layers, utils
def plot_history(history, val=0):
acc = history['accuracy']
if val == 1:
val_acc = history['val_accuracy'] # we can add a validation set in our fit function with nn
loss = history['loss']
if val == 1:
val_loss = history['val_loss']
x = range(1, len(acc) + 1)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(x, acc, 'b', label='Training accuracy')
if val == 1:
plt.plot(x, val_acc, 'r', label='Validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.title('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(x, loss, 'b', label='Training loss')
if val == 1:
plt.plot(x, val_loss, 'r', label='Validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.title('Loss')
plt.legend()
Requirement already satisfied: scikeras in c:\python\python311\lib\site-packages (0.12.0) Requirement already satisfied: packaging>=0.21 in c:\python\python311\lib\site-packages (from scikeras) (23.2) Requirement already satisfied: scikit-learn>=1.0.0 in c:\python\python311\lib\site-packages (from scikeras) (1.3.2) Requirement already satisfied: tensorflow-io-gcs-filesystem<0.32,>=0.23.1 in c:\python\python311\lib\site-packages (from scikeras) (0.31.0) Requirement already satisfied: numpy<2.0,>=1.17.3 in c:\python\python311\lib\site-packages (from scikit-learn>=1.0.0->scikeras) (1.24.1) Requirement already satisfied: scipy>=1.5.0 in c:\python\python311\lib\site-packages (from scikit-learn>=1.0.0->scikeras) (1.11.4) Requirement already satisfied: joblib>=1.1.1 in c:\python\python311\lib\site-packages (from scikit-learn>=1.0.0->scikeras) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\python\python311\lib\site-packages (from scikit-learn>=1.0.0->scikeras) (3.2.0)
[notice] A new release of pip is available: 23.2.1 -> 24.1.2 [notice] To update, run: python.exe -m pip install --upgrade pip [notice] A new release of pip is available: 23.2.1 -> 24.1.2 [notice] To update, run: python.exe -m pip install --upgrade pip
Collecting Keras-Preprocessing Obtaining dependency information for Keras-Preprocessing from https://files.pythonhosted.org/packages/79/4c/7c3275a01e12ef9368a892926ab932b33bb13d55794881e3573482b378a7/Keras_Preprocessing-1.1.2-py2.py3-none-any.whl.metadata Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl.metadata (1.9 kB) Requirement already satisfied: numpy>=1.9.1 in c:\python\python311\lib\site-packages (from Keras-Preprocessing) (1.24.1) Requirement already satisfied: six>=1.9.0 in c:\python\python311\lib\site-packages (from Keras-Preprocessing) (1.16.0) Using cached Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB) Installing collected packages: Keras-Preprocessing Successfully installed Keras-Preprocessing-1.1.2 WARNING:tensorflow:From C:\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
## CREATE MODEL
def create_model(num_filters=64, kernel_size=3, embedding_dim=50, maxlen=100, num_classes=2):
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(num_classes, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
## CLASS FOR PREPROCESSING (needed to work with pipelines)
class preprocessing():
def __init__(self, num_words=20000, maxlen=100):
self.maxlen = maxlen
self.tokenizer = Tokenizer(num_words=num_words)
def fit(self, X, y=None):
self.tokenizer.fit_on_texts(X)
return self
def transform(self, X, y=None):
X_ = self.tokenizer.texts_to_sequences(X)
return pad_sequences(X_, padding='post', maxlen=self.maxlen)
## PROCESS DATA
X_train, X_test, y_train, y_test = train_test_split(df["text"].values, df["label"].values, test_size=0.33, random_state=42)
# Encode the list of newsgroups into categorical integer values
y_train = utils.to_categorical(y_train)
y_test = utils.to_categorical(y_test)
## CREATE PIPELINE
# Use the best parameters in the pipe and fit with the entire dataset
pipe_preproc = Pipeline([
("preproc", preprocessing())])
pipe_est = Pipeline([
('clf', KerasClassifier(model=create_model,
epochs = 10,
batch_size=64,
verbose=True,
num_filters=32 )) #logistic regression
])
pipe_preproc.fit(X_train)
X_train_p = pipe_preproc.transform(X_train)
X_test_p = pipe_preproc.transform(X_test)
vocab_size = len(pipe_preproc["preproc"].tokenizer.word_index) + 1
print(vocab_size)
# test it works
pipe_est.fit(X_train_p[:500], y_train[:500])
12771 WARNING:tensorflow:From C:\Python\Python311\Lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From C:\Python\Python311\Lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead. Epoch 1/10 WARNING:tensorflow:From C:\Python\Python311\Lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead. WARNING:tensorflow:From C:\Python\Python311\Lib\site-packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead. 8/8 [==============================] - 1s 13ms/step - loss: 0.6559 - accuracy: 0.9020 Epoch 2/10 8/8 [==============================] - 0s 9ms/step - loss: 0.6074 - accuracy: 0.9020 Epoch 3/10 8/8 [==============================] - 0s 10ms/step - loss: 0.5551 - accuracy: 0.9020 Epoch 4/10 8/8 [==============================] - 0s 10ms/step - loss: 0.4977 - accuracy: 0.9020 Epoch 5/10 8/8 [==============================] - 0s 10ms/step - loss: 0.4343 - accuracy: 0.9020 Epoch 6/10 8/8 [==============================] - 0s 10ms/step - loss: 0.3756 - accuracy: 0.9020 Epoch 7/10 8/8 [==============================] - 0s 9ms/step - loss: 0.3311 - accuracy: 0.9020 Epoch 8/10 8/8 [==============================] - 0s 8ms/step - loss: 0.3062 - accuracy: 0.9020 Epoch 9/10 8/8 [==============================] - 0s 8ms/step - loss: 0.2940 - accuracy: 0.9020 Epoch 10/10 8/8 [==============================] - 0s 8ms/step - loss: 0.2849 - accuracy: 0.9020
Pipeline(steps=[('clf', KerasClassifier(batch_size=64, epochs=10, model=<function create_model at 0x000001E75E270180>, num_filters=32, verbose=True))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('clf', KerasClassifier(batch_size=64, epochs=10, model=<function create_model at 0x000001E75E270180>, num_filters=32, verbose=True))])
KerasClassifier( model=<function create_model at 0x000001E75E270180> build_fn=None warm_start=False random_state=None optimizer=rmsprop loss=None metrics=None batch_size=64 validation_batch_size=None verbose=True callbacks=None validation_split=0.0 shuffle=True run_eagerly=False epochs=10 num_filters=32 class_weight=None )
pipe_est["clf"].model_.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 100, 50) 638550 conv1d (Conv1D) (None, 98, 32) 4832 global_max_pooling1d (Glob (None, 32) 0 alMaxPooling1D) dense (Dense) (None, 10) 330 dense_1 (Dense) (None, 2) 22 ================================================================= Total params: 643734 (2.46 MB) Trainable params: 643734 (2.46 MB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
## HYPERPARAMETER TUNING
param_grid = dict(clf__model__num_filters=[32, 64, 128],
clf__model__kernel_size=[3, 5, 7],
clf__model__embedding_dim=[50, 100],
clf__verbose=[False])
grid = RandomizedSearchCV(estimator=pipe_est,
param_distributions=param_grid,
cv=5,
n_jobs=-1,
verbose=True,
n_iter=10)
grid.fit(X_train_p[:1000], y_train[:1000])
Fitting 5 folds for each of 10 candidates, totalling 50 fits INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpf6ej8bzb\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpf6ej8bzb\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpcntbe2_d\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpcntbe2_d\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpil_op0jx\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpil_op0jx\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp8dr3_7uj\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp8dr3_7uj\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp246u2clx\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp246u2clx\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmprzt5_646\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmprzt5_646\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpa9s10z_h\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpa9s10z_h\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpv2ucu6sf\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpv2ucu6sf\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpuiko4tvw\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpuiko4tvw\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpc3hyl99j\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpc3hyl99j\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmphlf_z2s6\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmphlf_z2s6\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpbbv2w7_r\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpbbv2w7_r\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp963w_7y2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp963w_7y2\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn( C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmplq8qk6kr\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmplq8qk6kr\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp9a7iplao\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp9a7iplao\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpasbb_9gb\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpasbb_9gb\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmph2gvi6or\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmph2gvi6or\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpt5rq0km8\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpt5rq0km8\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpi3604rw2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpi3604rw2\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpiuln2vt7\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpiuln2vt7\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn( C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmptp2d6rki\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmptp2d6rki\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpwy2hr4ff\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpwy2hr4ff\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpk3lq_307\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpk3lq_307\assets C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn(
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpan2uzvov\assets
C:\Python\Python311\Lib\site-packages\joblib\externals\loky\process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. warnings.warn( INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpan2uzvov\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpijheq8_i\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpijheq8_i\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpgbel95ex\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpgbel95ex\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp9j16797m\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp9j16797m\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmph3bkwt_y\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmph3bkwt_y\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp4zxf6sbq\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp4zxf6sbq\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp8yh4ut7c\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp8yh4ut7c\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpdqj5gj8u\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpdqj5gj8u\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp9rp4356u\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp9rp4356u\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpiu41ft_2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpiu41ft_2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp0vttu1uz\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp0vttu1uz\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpth9dj5hb\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpth9dj5hb\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpbdfp7pci\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpbdfp7pci\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpkhc516u3\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpkhc516u3\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpjhx1xyp2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpjhx1xyp2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp4f98i48b\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp4f98i48b\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp_8lq7ufm\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp_8lq7ufm\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp1zn5d9yt\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp1zn5d9yt\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpogsyoch2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpogsyoch2\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp5_172ri9\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp5_172ri9\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmppg1t_n_e\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmppg1t_n_e\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpt8syf94p\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpt8syf94p\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpbrwrwrt7\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpbrwrwrt7\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpjqk0bv4o\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpjqk0bv4o\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpmpqxkm1c\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpmpqxkm1c\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp0jvgrwme\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmp0jvgrwme\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpm2e1thwv\assets
INFO:tensorflow:Assets written to: C:\Users\ayoub\AppData\Local\Temp\tmpm2e1thwv\assets
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('clf', KerasClassifier(batch_size=64, epochs=10, model=<function create_model at 0x000001E75E270180>, num_filters=32, verbose=True))]), n_jobs=-1, param_distributions={'clf__model__embedding_dim': [50, 100], 'clf__model__kernel_size': [3, 5, 7], 'clf__model__num_filters': [32, 64, 128], 'clf__verbose': [False]}, verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('clf', KerasClassifier(batch_size=64, epochs=10, model=<function create_model at 0x000001E75E270180>, num_filters=32, verbose=True))]), n_jobs=-1, param_distributions={'clf__model__embedding_dim': [50, 100], 'clf__model__kernel_size': [3, 5, 7], 'clf__model__num_filters': [32, 64, 128], 'clf__verbose': [False]}, verbose=True)
Pipeline(steps=[('clf', KerasClassifier(batch_size=64, epochs=10, model=<function create_model at 0x000001E75E270180>, num_filters=32, verbose=True))])
KerasClassifier( model=<function create_model at 0x000001E75E270180> build_fn=None warm_start=False random_state=None optimizer=rmsprop loss=None metrics=None batch_size=64 validation_batch_size=None verbose=True callbacks=None validation_split=0.0 shuffle=True run_eagerly=False epochs=10 num_filters=32 class_weight=None )
print(grid.best_score_)
print(grid.best_params_)
0.893 {'clf__verbose': False, 'clf__model__num_filters': 128, 'clf__model__kernel_size': 7, 'clf__model__embedding_dim': 50}
# Use the best parameters in the pipe and fit with the entire dataset
clf_best = grid.best_estimator_
clf_best = pipe_est.fit(X_train_p, y_train,
clf__validation_data=(X_test_p, y_test))
Epoch 1/10 113/113 [==============================] - 2s 9ms/step - loss: 0.4504 - accuracy: 0.8760 - val_loss: 0.3393 - val_accuracy: 0.8933 Epoch 2/10 113/113 [==============================] - 1s 7ms/step - loss: 0.3433 - accuracy: 0.8858 - val_loss: 0.3146 - val_accuracy: 0.8933 Epoch 3/10 113/113 [==============================] - 1s 7ms/step - loss: 0.2727 - accuracy: 0.8859 - val_loss: 0.2571 - val_accuracy: 0.8944 Epoch 4/10 113/113 [==============================] - 1s 8ms/step - loss: 0.1662 - accuracy: 0.9278 - val_loss: 0.2506 - val_accuracy: 0.9074 Epoch 5/10 113/113 [==============================] - 1s 8ms/step - loss: 0.0830 - accuracy: 0.9736 - val_loss: 0.2744 - val_accuracy: 0.9060 Epoch 6/10 113/113 [==============================] - 1s 8ms/step - loss: 0.0384 - accuracy: 0.9933 - val_loss: 0.3088 - val_accuracy: 0.9040 Epoch 7/10 113/113 [==============================] - 1s 12ms/step - loss: 0.0191 - accuracy: 0.9968 - val_loss: 0.3622 - val_accuracy: 0.9043 Epoch 8/10 113/113 [==============================] - 1s 11ms/step - loss: 0.0101 - accuracy: 0.9986 - val_loss: 0.3978 - val_accuracy: 0.8995 Epoch 9/10 113/113 [==============================] - 1s 8ms/step - loss: 0.0052 - accuracy: 0.9993 - val_loss: 0.4414 - val_accuracy: 0.8989 Epoch 10/10 113/113 [==============================] - 1s 8ms/step - loss: 0.0027 - accuracy: 1.0000 - val_loss: 0.4534 - val_accuracy: 0.8972
import pylab as plt
plot_history(clf_best["clf"].history_, val=True)
# Use the best epochs
clf_best = grid.best_estimator_
clf_best = pipe_est.fit(X_train_p, y_train,
clf__validation_data=(X_test_p, y_test),
clf__epochs=4)
Epoch 1/4 113/113 [==============================] - 2s 9ms/step - loss: 0.4133 - accuracy: 0.8858 - val_loss: 0.3394 - val_accuracy: 0.8933 Epoch 2/4 113/113 [==============================] - 1s 8ms/step - loss: 0.3370 - accuracy: 0.8858 - val_loss: 0.2967 - val_accuracy: 0.8933 Epoch 3/4 113/113 [==============================] - 1s 8ms/step - loss: 0.2594 - accuracy: 0.8858 - val_loss: 0.2510 - val_accuracy: 0.8933 Epoch 4/4 113/113 [==============================] - 1s 9ms/step - loss: 0.1727 - accuracy: 0.8858 - val_loss: 0.2421 - val_accuracy: 0.8933
# Find some extreme examples
less_hate = "- YouTube"
most_hate = "Look what happens when Whites leave black Countries alone to do what they do naturally The blacks in White Countries today should be on their knees thanking Whites for trying to civilize them"
pred_hate_not_hate = "Too many whites think they deserve what negroes dish out because of guilt ."
pred_not_hate_hate = "https://www.stormfront.org/forum/t1020784/ https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden God save them ....."
pred_50_50 = "She says the class is out of control and the kids are unteachable , and the black administration does not support her "
print("Least hate: ", less_hate)
print("Most hate: ", most_hate)
print("Predicted very hate but not hate: ", pred_hate_not_hate)
print("Predicted non hate but hate: ", pred_not_hate_hate)
print("Predicted 50/50: ", pred_50_50)
Least hate: - YouTube Most hate: Look what happens when Whites leave black Countries alone to do what they do naturally The blacks in White Countries today should be on their knees thanking Whites for trying to civilize them Predicted very hate but not hate: Too many whites think they deserve what negroes dish out because of guilt . Predicted non hate but hate: https://www.stormfront.org/forum/t1020784/ https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t102 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t101 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden https : //www.stormfront.org/forum/t100 ... ghlight = sweden God save them ..... Predicted 50/50: She says the class is out of control and the kids are unteachable , and the black administration does not support her
# start the explainer
explainer = LimeTextExplainer(class_names = ["Innocuous", "Hate"], bow=False)
# relying on global objects (not too nice)
def create_proba(text):
t = pipe_preproc.transform(text)
return clf_best.predict_proba(t)
# shows the explanation for our example instances
for text in [less_hate, most_hate, pred_hate_not_hate, pred_not_hate_hate, pred_50_50]:
exp = explainer.explain_instance(text,
create_proba,
num_features = 10,
num_samples = 1000)
exp.show_in_notebook(text=text)
print(exp.as_list())
print("-"*100)
16/16 [==============================] - 0s 2ms/step
[('YouTube', -0.005017741352829543)] ---------------------------------------------------------------------------------------------------- 16/16 [==============================] - 0s 1ms/step
[('they', 0.07421747268064614), ('black', 0.06455750482601273), ('blacks', 0.057711908213506906), ('Whites', 0.028472882230942312), ('Whites', 0.028238350550043145), ('civilize', 0.026510297354229932), ('them', 0.02562221510983088), ('should', 0.02515117415376788), ('in', -0.02385456011812363), ('Countries', 0.018718012527703776)] ---------------------------------------------------------------------------------------------------- 16/16 [==============================] - 0s 1ms/step
[('negroes', 0.1721776671139979), ('they', 0.07944878528195191), ('whites', 0.04622496345322855), ('deserve', 0.03123699828865405), ('guilt', 0.027752245031371074), ('Too', 0.026850542137581025), ('dish', 0.017901778094841763), ('what', 0.016136463673221767), ('think', -0.015373063810062496), ('many', 0.007079561282276488)] ---------------------------------------------------------------------------------------------------- 16/16 [==============================] - 0s 2ms/step
[('them', 0.004379349408166603), ('God', 0.0017941072566711855), ('save', 0.0016293971091619662), ('sweden', 0.0016217969457270782), ('https', -0.0005610560979287297), ('stormfront', -0.0005201413165225269), ('org', 0.0005063619644461128), ('sweden', 0.0004786615557607865), ('stormfront', -0.0004680285469822718), ('sweden', 0.0004424819341155667)] ---------------------------------------------------------------------------------------------------- 16/16 [==============================] - 0s 2ms/step
[('black', 0.17831546907909726), ('does', 0.06859904623619073), ('the', 0.03445653565611717), ('not', -0.025820972556917007), ('support', 0.020796583278954506), ('class', -0.01839936394386279), ('her', 0.015399122251065383), ('administration', -0.014223240930875682), ('kids', -0.012162133141891266), ('the', 0.01147971558017856)] ----------------------------------------------------------------------------------------------------