.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/use_cases/plot_fraud_prediction.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_use_cases_plot_fraud_prediction.py: Tracking all the data processing ================================ To track all operations and be able to apply the fitted estimator to unseen data, we need to include all the data wrangling in the estimator used for our skore report. In very simple cases this can be done with a scikit-learn Pipeline. When we have transformations not supported by the Pipeline (such as transformations that change the number of rows, or that involve multiple tables such as joins), skore allows us to use a skrub DataOp instead. In this example we consider a dataset that is simple, but still requires some data wrangling (encoding, aggregation and joining) which could not be performed in a regular scikit-learn estimator. To track those operations, we use a skrub DataOp, which can perform richer transformations than normal estimators, and also has built-in support from skore. The dataset contains a list of online transactions (each corresponds to a cart, or "basket"), each linked to one or more products for which we have a description. The task is to predict which involved credit fraud. .. GENERATED FROM PYTHON SOURCE LINES 26-30 We start by defining our data-processing pipeline. Note that it contains operations, such as aggregating and joining the product information after vectorizing the text it contains, that would not be possible in a normal estimator. .. GENERATED FROM PYTHON SOURCE LINES 32-73 .. code-block:: Python import skore import skrub from sklearn.ensemble import HistGradientBoostingClassifier dataset = skrub.datasets.fetch_credit_fraud(split="all") products = skrub.var("products", dataset.products) baskets = skrub.var("baskets", dataset.baskets) basket_ids = baskets[["ID"]].skb.mark_as_X() fraud_flags = baskets["fraud_flag"].skb.mark_as_y() def filter_products(products, basket_ids): return products[products["basket_ID"].isin(basket_ids["ID"])] vectorized_products = products.skb.apply_func(filter_products, basket_ids).skb.apply( skrub.TableVectorizer(), exclude_cols="basket_ID" ) def join_product_info(basket_ids, vectorized_products): return basket_ids.merge( vectorized_products.groupby("basket_ID").agg("mean").reset_index(), left_on="ID", right_on="basket_ID", ).drop(columns=["ID", "basket_ID"]) pred = basket_ids.skb.apply_func(join_product_info, vectorized_products).skb.apply( HistGradientBoostingClassifier(), y=fraud_flags ) # This would generate a report with previews of intermediate results & fitted # estimators: # # pred.skb.full_report() pred .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading 'credit_fraud' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/credit_fraud.zip (attempt 1/3) .. raw:: html

<Apply HistGradientBoostingClassifier>

Show graph

Result:

	fraud_flag
0	0
1	0
2	0
3	0
4	0

92,785	0
92,786	0
92,787	0
92,788	0
92,789	0

Column	Column name	dtype	Is sorted	Null values	Unique values	Mean	Std	Min	Median	Max
0	fraud_flag	Int64DType	False	0 (0.0%)	2 (< 0.1%)	0.00237	0.0486	0	0	1

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

.. GENERATED FROM PYTHON SOURCE LINES 74-81 Above we see a preview on the whole dataset. Click the "show graph" toggle to see a drawing of the pipeline we have built. Just like a normal estimator, a skrub DataOp can be used with skore reports. We can either pass separately a SkrubLearner and training and testing data, or pass our DataOp with the data it already contains and rely on the default train/test split: .. GENERATED FROM PYTHON SOURCE LINES 83-86 .. code-block:: Python report = skore.EstimatorReport(pred, pos_label=1) report.metrics.roc_auc() .. rst-class:: sphx-glr-script-out .. code-block:: none 0.8582518262594516 .. GENERATED FROM PYTHON SOURCE LINES 87-89 .. code-block:: Python report.metrics.precision_recall().plot() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_fraud_prediction_001.png :alt: Precision-Recall Curve for DataOp Positive label: 1 Data source: Test set :srcset: /auto_examples/use_cases/images/sphx_glr_plot_fraud_prediction_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none

.. GENERATED FROM PYTHON SOURCE LINES 90-92 Note that the preprocessing operations are captured in the skrub DataOp, hence in our report -- so we can replay them later on unseen data. .. GENERATED FROM PYTHON SOURCE LINES 94-95 .. code-block:: Python report.estimator_.data_op .. raw:: html

.. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 0.801 seconds) .. _sphx_glr_download_auto_examples_use_cases_plot_fraud_prediction.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_fraud_prediction.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_fraud_prediction.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_fraud_prediction.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_