Detection of Counterfeit Bills with Python

geopolitoon

3 min read

Imagine being able to build an algorithm capable of determining, based on the geometric characteristics of a bill, whether it is authentic or counterfeit. This is the fascinating challenge we explore in this article: how to use technology to accurately distinguish between real bills and fake ones.

We used a sample file containing 1,500 bills with a variable added to specify the nature of the bill.

The developed algorithm aims to detect counterfeit bills by analyzing their geometric features.

1. Descriptive Analysis

Histogram Plots

The distributions of the variables appear to follow a symmetrical bell shape, characteristic of a normal distribution, for both real and fake bills. However, we will need to evaluate the variable 'length' later with a statistical test, as it seems to show a different distribution.

Furthermore, there seem to be significant differences between the values of the variables for real and fake bills, with the notable exception of the variable 'diagonal'. The most pronounced distinctions appear particularly for the variables 'margin_low', 'margin_up', and 'length'.

Boxplot for Each Explanatory Variable

It is noteworthy to distinguish differences between real and fake bills, except for the variable 'diagonal'. Values that appear to be outliers in this boxplot are not necessarily so but may simply reflect the inherent variability in the data.

Correlation Matrix

The variables most correlated with is_genuine are margin_low, margin_up, and length.

Pairplot

Parameterization File:

66.6% Real Bills 33.3% Fake Bills

Through these various graphs, it appears that the variables 'margin_low' and 'length' form more distinct groups than the other variables. Therefore, it would be interesting to focus on these two variables. However, it is important not to neglect the other variables, as they may also provide valuable information.

Analysis of Variance

From the start of our analyses, we observe a quite clear difference between real and fake bills for most variables, with the exception of the variable 'diagonal'. To delve deeper into this observation, I performed an analysis of variance (ANOVA) test.

The objective of this analysis is to determine whether our six variables show significant differences between real and fake bills. The results reveal that the p-value for each variable is well below 0.05. This means that we reject the null hypothesis (H0), which states that there is no significant difference between the groups for these variables. Therefore, even the variable 'diagonal' can be considered relevant for our model.

2. Handling Missing Values

Check for Variable Collinearity

Hypothesis: Independent variables should not be too highly correlated.
Conclusion: All coefficients are below 10, so there is no collinearity issue, thus validating our hypothesis.

Test for Homoscedasticity

To test for homoscedasticity (constant variance) of the residuals, we use the Breusch-Pagan test.

Hypothesis: The variance of errors is constant across all values of the independent variables.
Conclusion: The p-value is below 5%, leading us to reject the hypothesis that variances are constant. There is thus heteroscedasticity in the residuals of the regression model, indicating that the variability of errors is not uniform for all observations in the sample.

Test for Normality of Residuals

In the context of linear regression, the residuals should follow a normal distribution.

Hypothesis: The errors follow a normal distribution.
Interpretation: The Shapiro-Wilk statistic gives a p-value well below 0.05, indicating that we reject the null hypothesis of normality of the residuals. However, upon observing the residuals, we find that they do not deviate significantly from normality and are centered around 1. The Q-Q plot also shows a slight deviation with points straying from the red line, but not too far.

Correct Model Specification

Hypothesis: The model is correctly specified, including all relevant variables.
Conclusion: All 5 variables are retained.

Test for Linearity

Hypothesis: The relationship between predictors and the target is linear.
Conclusion: We see this in the scatter plots.

Test for Independence of Errors

We perform a Durbin-Watson test to detect autocorrelation of the residuals.

Hypothesis: The errors are independent of each other.
Conclusion: The result is slightly below 2, suggesting a slight positive autocorrelation of the residuals in our multiple linear regression model.

3. Modeling

Logistic Regression

The number circled in red indicates an erroneous prediction: this counterfeit bill has characteristics very close to those of a real bill.

Clustering

In conclusion, among the three models evaluated, the one from Scikit-Learn proved to be the most effective, making only 4 errors when applied to the entire dataset, compared to 5 errors for the StatsModels model and 16 for K-means. Therefore, this will be the one retained.