How might data sets contain biases?

In general, when we talk about the future or predictions, we want to understand whether we are dealing with a prophecy that comes from the thinking of a magician or a result generated by scientific research.

Once this demarcation is established, in the case of the scientific approach we want to understand how the result was obtained and to the same extent we expect that result to be obtained in an ethical manner because it is known that prediction algorithms have the potential to discriminate based on the attributes used.

In this context, there are some attributes that require more attention from developers because by their nature they can introduce unfair treatment, they can be biased and implicitly they can lead to a violation of ethical principles. Examples of such attributes are age, religion, race, gender, disability, national origin.

Among the classifications regarding the biased data, we opted to divide them into 2 groups, namely:

biases generated by the methodology and/or the people who are involved in the research
prejudices related to the existence of injustices in historical reality and therefore encapsulated in the data collected for the purpose of research

Regarding the biases generated by the methodology and/or by the people who are involved in the research, we mention some non-compliant trends from an ethical point of view, such as:

failure to report all available information,
ignoring contradictory information when it is correct,
biased data collection as well
a potential inclination of the developer towards overgeneralization, stereotypes and/or certain prejudices in the data modeling process.

As for the prejudices related to the existence of historical injustices, they are related to a historical reality based on an inequality of opportunity or parity due to the fact that human enlightenment is in a continuous evolution. Thus, the data the model will learn may be biased by reflecting past ethical principles rather than present, let alone future, ethical principles.

To mitigate these problems, we should strive to ensure impartiality in the construction of the modeling process. By this we mean at least the avoidance of behavior that may induce an unethical model, the use of unbiased data, the formalization of non-discriminatory or equalization criteria, demographic parity, and the calibration of models for equality of opportunity.

Undoubtedly, in addition to the above, we must understand the responsibility we bear for the realization of forecasting algorithms and adopt a behavior based on high ethical standards so that the research results contribute to the flattening of existing inequalities.