.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/technical_details/plot_cache_mechanism.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_technical_details_plot_cache_mechanism.py: .. _example_cache_mechanism: =============== Cache mechanism =============== This example shows how :class:`~skore.EstimatorReport` and :class:`~skore.CrossValidationReport` use caching to speed up computations. .. GENERATED FROM PYTHON SOURCE LINES 13-19 Loading some data ================= First, we load a dataset from `skrub`. Our goal is to predict if a company paid a physician. The ultimate goal is to detect potential conflict of interest when it comes to the actual problem that we want to solve. .. GENERATED FROM PYTHON SOURCE LINES 19-25 .. code-block:: Python from skrub.datasets import fetch_open_payments dataset = fetch_open_payments() df = dataset.X y = dataset.y .. GENERATED FROM PYTHON SOURCE LINES 26-30 .. code-block:: Python from skrub import TableReport TableReport(df) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 31-35 .. code-block:: Python import pandas as pd TableReport(pd.DataFrame(y)) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 36-38 The dataset has over 70,000 records with only categorical features. Some categories are not well defined. .. GENERATED FROM PYTHON SOURCE LINES 41-46 Caching with :class:`~skore.EstimatorReport` and :class:`~skore.CrossValidationReport` ====================================================================================== We use `skrub` to create a simple predictive model that handles our dataset's challenges. .. GENERATED FROM PYTHON SOURCE LINES 46-52 .. code-block:: Python from skrub import tabular_pipeline model = tabular_pipeline("classifier") model .. raw:: html
Pipeline(steps=[('tablevectorizer',
                     TableVectorizer(low_cardinality=ToCategorical())),
                    ('histgradientboostingclassifier',
                     HistGradientBoostingClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 53-55 This model handles all types of data: numbers, categories, dates, and missing values. Let's train it on part of our dataset. .. GENERATED FROM PYTHON SOURCE LINES 56-64 .. code-block:: Python from skore import train_test_split X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42) # Let's keep a completely separate dataset X_train, X_external, y_train, y_external = train_test_split( X_train, y_train, random_state=42 ) .. rst-class:: sphx-glr-script-out .. code-block:: none ╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮ │ It seems that you have a classification problem with a high class imbalance. In this │ │ case, using train_test_split may not be a good idea because of high variability in │ │ the scores obtained on the test set. To tackle this challenge we suggest to use │ │ skore's CrossValidationReport with the `splitter` parameter of your choice. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮ │ It seems that you have a classification problem with a high class imbalance. In this │ │ case, using train_test_split may not be a good idea because of high variability in │ │ the scores obtained on the test set. To tackle this challenge we suggest to use │ │ skore's CrossValidationReport with the `splitter` parameter of your choice. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 65-73 Caching the predictions for fast metric computation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ First, we focus on :class:`~skore.EstimatorReport`, as the same philosophy will apply to :class:`~skore.CrossValidationReport`. Let's explore how :class:`~skore.EstimatorReport` uses caching to speed up predictions. We start by training the model: .. GENERATED FROM PYTHON SOURCE LINES 73-85 .. code-block:: Python from skore import EstimatorReport report = EstimatorReport( model, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, pos_label="allowed", ) report.help() .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 86-87 We compute the accuracy on our test set and measure how long it takes: .. GENERATED FROM PYTHON SOURCE LINES 88-95 .. code-block:: Python import time start = time.time() result = report.metrics.accuracy() end = time.time() result .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9509516041326808 .. GENERATED FROM PYTHON SOURCE LINES 96-98 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 1.56 seconds .. GENERATED FROM PYTHON SOURCE LINES 99-100 For comparison, here's how scikit-learn computes the same accuracy score: .. GENERATED FROM PYTHON SOURCE LINES 101-108 .. code-block:: Python from sklearn.metrics import accuracy_score start = time.time() result = accuracy_score(report.y_test, report.estimator_.predict(report.X_test)) end = time.time() result .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9509516041326808 .. GENERATED FROM PYTHON SOURCE LINES 109-111 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 1.59 seconds .. GENERATED FROM PYTHON SOURCE LINES 112-116 Both approaches take similar time. Now, watch what happens when we compute the accuracy again with our skore estimator report: .. GENERATED FROM PYTHON SOURCE LINES 117-122 .. code-block:: Python start = time.time() result = report.metrics.accuracy() end = time.time() result .. rst-class:: sphx-glr-script-out .. code-block:: none 0.9509516041326808 .. GENERATED FROM PYTHON SOURCE LINES 123-125 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 0.00 seconds .. GENERATED FROM PYTHON SOURCE LINES 126-128 The second calculation is instant! This happens because the report saves previous calculations in its cache. Let's look inside the cache: .. GENERATED FROM PYTHON SOURCE LINES 129-131 .. code-block:: Python report._cache .. rst-class:: sphx-glr-script-out .. code-block:: none {(6161188341340137975, None, 'predict', 'test'): array(['allowed', 'disallowed', 'disallowed', ..., 'disallowed', 'disallowed', 'disallowed'], shape=(18390,), dtype=object), (6161188341340137975, 'test', 'predict_time'): 1.5427202010000087, (6161188341340137975, 'accuracy_score', 'test', None, ('mapping', ())): 0.9509516041326808} .. GENERATED FROM PYTHON SOURCE LINES 132-135 The cache stores predictions by type and data source. This means that computing metrics that use the same type of predictions will be faster. Let's try the precision metric: .. GENERATED FROM PYTHON SOURCE LINES 135-140 .. code-block:: Python start = time.time() result = report.metrics.precision() end = time.time() result .. rst-class:: sphx-glr-script-out .. code-block:: none 0.6574519230769231 .. GENERATED FROM PYTHON SOURCE LINES 141-143 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 0.06 seconds .. GENERATED FROM PYTHON SOURCE LINES 144-149 We observe that it takes only a few milliseconds to compute the precision because we don't need to re-compute the predictions and only have to compute the precision metric itself. Since the predictions are the bottleneck in terms of computation time, we observe an interesting speedup. .. GENERATED FROM PYTHON SOURCE LINES 151-155 Caching all the possible predictions at once ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We can pre-compute all predictions at once using parallel processing: .. GENERATED FROM PYTHON SOURCE LINES 155-157 .. code-block:: Python report.cache_predictions(n_jobs=4) .. GENERATED FROM PYTHON SOURCE LINES 158-160 Now, all possible predictions are stored. Any metric calculation will be much faster, even on different data (like the training set): .. GENERATED FROM PYTHON SOURCE LINES 161-166 .. code-block:: Python start = time.time() result = report.metrics.log_loss(data_source="train") end = time.time() result .. rst-class:: sphx-glr-script-out .. code-block:: none 4.089944167790945 .. GENERATED FROM PYTHON SOURCE LINES 167-169 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 0.09 seconds .. GENERATED FROM PYTHON SOURCE LINES 170-174 Caching for plotting ^^^^^^^^^^^^^^^^^^^^ The cache also speeds up plots. Let's create a ROC curve: .. GENERATED FROM PYTHON SOURCE LINES 174-180 .. code-block:: Python start = time.time() display = report.metrics.roc() display.plot() end = time.time() .. image-sg:: /auto_examples/technical_details/images/sphx_glr_plot_cache_mechanism_001.png :alt: ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set :srcset: /auto_examples/technical_details/images/sphx_glr_plot_cache_mechanism_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 181-183 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 0.14 seconds .. GENERATED FROM PYTHON SOURCE LINES 184-185 The second plot is instant because it uses cached data: .. GENERATED FROM PYTHON SOURCE LINES 186-191 .. code-block:: Python start = time.time() display = report.metrics.roc() display.plot() end = time.time() .. image-sg:: /auto_examples/technical_details/images/sphx_glr_plot_cache_mechanism_002.png :alt: ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set :srcset: /auto_examples/technical_details/images/sphx_glr_plot_cache_mechanism_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 192-194 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 0.13 seconds .. GENERATED FROM PYTHON SOURCE LINES 195-197 We only use the cache to retrieve the `display` object and not directly the matplotlib figure. It means that we can still customize the cached plot before displaying it: .. GENERATED FROM PYTHON SOURCE LINES 198-201 .. code-block:: Python display.set_style(relplot_kwargs={"color": "tab:orange"}) display.plot() .. image-sg:: /auto_examples/technical_details/images/sphx_glr_plot_cache_mechanism_003.png :alt: ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set :srcset: /auto_examples/technical_details/images/sphx_glr_plot_cache_mechanism_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 202-203 Be aware that we can clear the cache if we want to: .. GENERATED FROM PYTHON SOURCE LINES 204-207 .. code-block:: Python report.clear_cache() report._cache .. rst-class:: sphx-glr-script-out .. code-block:: none {} .. GENERATED FROM PYTHON SOURCE LINES 208-215 It means that nothing is stored anymore in the cache. Caching with :class:`~skore.CrossValidationReport` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :class:`~skore.CrossValidationReport` uses the same caching system for each split in cross-validation by leveraging the previous :class:`~skore.EstimatorReport`: .. GENERATED FROM PYTHON SOURCE LINES 216-221 .. code-block:: Python from skore import CrossValidationReport report = CrossValidationReport(model, X=df, y=y, splitter=5, n_jobs=4) report.help() .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 222-226 Since a :class:`~skore.CrossValidationReport` uses many :class:`~skore.EstimatorReport`, we will observe the same behaviour as we previously exposed. The first call will be slow because it computes the predictions for each split. .. GENERATED FROM PYTHON SOURCE LINES 227-232 .. code-block:: Python start = time.time() result = report.metrics.summarize().frame() end = time.time() result .. raw:: html
HistGradientBoostingClassifier
mean std
Metric Label / Average
Accuracy None 0.917969 0.036709
Precision allowed 0.437609 0.123632
disallowed 0.960294 0.005575
Recall allowed 0.427645 0.100147
disallowed 0.951808 0.043824
ROC AUC None 0.872908 0.033272
Brier score None 0.063584 0.033333
Fit time (s) None 24.027426 7.811560
Predict time (s) None 4.091575 1.631683


.. GENERATED FROM PYTHON SOURCE LINES 233-235 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 12.60 seconds .. GENERATED FROM PYTHON SOURCE LINES 236-237 But the subsequent calls are fast because the predictions are cached. .. GENERATED FROM PYTHON SOURCE LINES 238-243 .. code-block:: Python start = time.time() result = report.metrics.summarize().frame() end = time.time() result .. raw:: html
HistGradientBoostingClassifier
mean std
Metric Label / Average
Accuracy None 0.917969 0.036709
Precision allowed 0.437609 0.123632
disallowed 0.960294 0.005575
Recall allowed 0.427645 0.100147
disallowed 0.951808 0.043824
ROC AUC None 0.872908 0.033272
Brier score None 0.063584 0.033333
Fit time (s) None 24.027426 7.811560
Predict time (s) None 4.091575 1.631683


.. GENERATED FROM PYTHON SOURCE LINES 244-246 .. code-block:: Python print(f"Time taken: {end - start:.2f} seconds") .. rst-class:: sphx-glr-script-out .. code-block:: none Time taken: 0.01 seconds .. GENERATED FROM PYTHON SOURCE LINES 247-248 Hence, we observe the same type of behaviour as we previously exposed. .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 20.526 seconds) .. _sphx_glr_download_auto_examples_technical_details_plot_cache_mechanism.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_cache_mechanism.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_cache_mechanism.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_cache_mechanism.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_