The Strengthening of Privacy Regulation and Implications for Machine Learning
Machine learning has the capability to radical improve business process performance with some experts citing possible benefits an order of magnitude over best in class unaugmented workflows. Businesses are scrambling to adopt this novel technology at a rapid pace as a result. Two important factors key to the widespread adoption of machine learning are public trust and perceived fairness. If either of these is missing then machine learning is likely to meet significant pushback.
Beyond this, increasingly, privacy regulators are introducing requirements strongly constraining the way machine learning systems are trained and deployed in business processes.
In this post, five important considerations will be explored for management and data science teams rolling out this powerful new set of technologies, specifically:
The requirement for privacy by design and default (not just guidelines)Regulatory accountability and traceability requirements, the requirement to documentFreedom from discriminationA right of data subjects to explanation of how automated decisions are madeAnd the right to obtain human intervention
The first two considerations address the question of how organizations maintain proper governance in the handling of private information as well as how they go about training their models and building their business solutions. A strong set of internal design and governance processes is required to demonstrate to regulators that organization are committed to meeting the spirit of privacy regulations, minimizing the risk of penalties should a breach occur. Best practices incorporate guarding of privacy rights in overall enterprise risk management strategies and mechanisms thereby reducing duplication and cost of compliance. A corporate wide perspective addresses the fact that in many instances sensitive systems development will occur outside traditional IT, in Marketing and Sales, Business Units or other functional areas. Governance needs to be organization wide to be effective.
Beyond high level processes and governance, the first requirement places strict and potentially novel restrictions on how personal data is handled in design organizations. Access to raw datasets must be prohibited either through effective technical or organizational restrictions. If data cannot be made truly anonymous (rendered impossible to link to a particular data subject) then effective pseudonymization techniques must be used. This typically involves splitting of datasets or the hiding of fields such that identification of the data subject is no longer pragmatically achievable. Data flows in the design process will need to be documented, access control lists need to be defined and operationalized. Effective policies on user devices, subcontractors and work at home arrangements need to be in place so these practices don’t create potential gaps in meeting regulatory compliance requirements.
The widespread respect and adherence from business and public facing organizations of strong privacy rights is key to the public’s perception of fairness and trust. Core to this need is an assurance of freedom from discrimination and the guarantee of fundamental rights and freedoms. Beyond the perceptions of the public, regulators are also becoming increasingly active. Articles 9 and 22 of the EU General Data Protection Regulations now in affect prohibit the use of special sensitive classes of personal data such as racial origin from use in machine learning (automatic decision making). Jurisdictions outside the EU can be expected to move towards similar restrictions.
A simplistic interpretation of this requirement could lead to a determination that elimination of the sensitive information from data records is sufficient. However on deeper consideration it is evident that this is not the case with the very likely correlation of other data fields (eg. geographic location) to sensitive information. The designers of machine learning solutions must always be on the lookout for hidden biases in the input dataset as these will be reproduced in the trained model.
As an additional complication, a negative property can emerge in machine learning systems known as “uncertainty bias”. This situation results when a trained model is more uncertain in cases of features in the data when they are under-represented coupled with a decision algorithm which is risk averse, eg. that avoids making a recommendations when the probability of a valid prediction is low. The overall result can be an automatic decision making system which is biased against minority categories in the training dataset. If the model also continuously learns (usually a positive trait ensuring the model remains valid over time) then selection bias can result; the majority category being used preferentially and the minority category being represented less and less.
To protect citizens from the risks of bias, EU regulators have incorporated a right to explanation in the GDPR. Articles 13 and 14 states that data subjects have the right to request meaningful information about the logic involved in “profiling” when applied to automatic decision making.
Machine learning algorithms discover reliable associations / correlations to make accurate predictions about novel input data in the absence of easily determined rules. Predictions represent a probability that a novel condition will be representative of the training data with no readily discernible explanation of why this is the case. Some of the best performing architectures, such as, multi-dimensional, multi-layer neural networks and ensemble methods, where disparate models are considered, are especially hard to interpret.
As a further safeguard, article 22 defines the right of data subjects to obtain human intervention in cases of possible major social or economic impact. This has implications for how many process that uses machine learning are designed. The ability to “zero out” to talk to a real human agent, or equivalent, needs to be accommodated. This obviously adds process complexity and duplication watering down benefits.
In an effort to address this situation the machine learning development community has been striving to improve the transparency of their models. Efforts in this area fall into two broad approaches. The first involves constraining machine learning system architectures, a priori, so that biases cannot be learned from the data. A number of methods are being pursued but all suffer from inferior accuracy and difficulty in design. This is not surprising, if explicit rules were easy to formulate then a learned model would not be required!
The second approach is to a posteriori evaluate the model for bias. In addition to maintaining model accuracy and protecting data subjects from adverse impacts, eg. discrimination, this method has a number of benefits, namely:
Allowing effective governance and management accountability for the decisions made by their trained systemsProtection of data subjects through the detection and correction of data input errors and furthermore by explaining rationales for decisions, direction is provided on what needs to change for a different decision to be madeOrganizations sourcing trained models from third parties can derisk their position by employing suitable tools to evaluate prospective vendor solutions for biasWith adequate transparency reporting, this approach allows for effective industry or regulatory oversight accelerating market acceptance
It is in the best interest of businesses to ensure that systems incorporating machine learning not only are fit for purpose, effective and accurate but also that they are transparent and fair. Effective privacy controls and data security coupled with steps to eliminate bias will build trust in the public, drive adoption and can be an effective differentiator for best practitioners.
References for this article.
European Union regulations on algorithmic decision-making and a “right to explanation”Goodman and Flaxman, Oxford Internet Institute, August 31st, 2016, arXiv:1606.08813v3
Regulation (EU) 2016/679 of the European Parliament and Council of April 27th, 2016, Official Journal of the European Union (GDPR Official Text)
How the machine thinks, understanding opacity in machine learning algorithms, Burrell, 2016, Big Data and Society, 3(1)
Algorithmic Transparency via Quantitative Input Influence, Anupam Datta, Shayak Sen and Yair Zick, 2016, IEEE, ISSN:2375-1207