Your data are racist

Say youâ€™re a university loan administrator. You have one loan and two students, who anonymized seem to you in every way identical: same GPA, same payment history, all that good stuff. You canâ€™t decide. You ask a data-driven startup to determine which oneâ€™s the greater risk to default or pay late. You have no idea how they do it, but it comes back â€”

The answerâ€™s clear: Student A!

Congratulations, youâ€™ve just perpetuated historical racism.

You didn’t know it. The startup didn’t know it: evaluated both students and found Student Aâ€™s social networks are stable, and their secondary characteristics are all strongly correlated with future prosperity and loan repayment. Student B scores less well on social network and their high school, home zip code, interests, and demographic data are associated with significantly higher rates of late payments and loan defaults.
From their perspective, theyâ€™re telling you â€” accurately, most likely â€” which of those two is the better loan risk.

No one at the startup may even know what the factors were. They may have grabbed all the financial data they could get from different sources, tied it to all the demographic data they could find, and set machine learning on the problem, creating a black box that they show is significantly better than a loan officer or risk analyst at guessing who’s going to default. No one

Machine learning goes like this: you feed the machine a sample of, say, 10,000 records, like

Record Foo

Zip Code: 98112
Married: Yes
Age: 24
Likes Corgis: Yes
Defaulted on a student loan: No
Made at least one payment more than 90 days late: No

Record Bar

Zip Code: 91121
Married: No
Age: 34
Likes Corgis: No
Defaulted on a student loan: Yes
Made at least one payment more than 90 days late: Yes

You set the tools on it and it’ll find characteristics and combinations of characteristics that it associates with the outcomes, so that if you hand it a new record: Angela, a 22-year old from Boston, unmarried, doesnâ€™t like Corgis, and your black box says â€œIâ€™m 95% sure theyâ€™ll default.â€

Itâ€™s the ultimate in finding correlation and assuming causation.

You see how good it is by giving it sets of people where you know the outcome and see what the box predicts.

You donâ€™t even want to know what the characteristics are, because you might dismiss something that turns out to be important (â€œPeople who buy riding lawnmowers buy black drip coffee at premium shops? What?â€).

Because machine learning is trained up on the past, it means that itâ€™s looking at what people did while being discriminated against, operating at a disadvantage, and so on.

For instance, say you take ZIP Codes as an input to your model. Makes sense, right? Thatâ€™s a perfectly valid piece of data. Itâ€™s a great predictor of future prosperity and wealth. And you can see that people in certain areas are fired more from their jobs, and have a much harder time finding new ones, and so default on payments more often. Is it okay using that as a factor?

Because America spent so long segregating housing, and because those effects continue forward, using ZIP means that given ZIP X Iâ€™m 80% certain youâ€™re white. Or 90% if youâ€™re in 98110.

We donâ€™t even have to know, as someone using the model, that someone is black. I just see that people in that zip code predict defaults, or being good on your loan. Or I might not even know that my trained black box loves ZIP Codes.

And if you can use address information to break it down to census tract and census block, youâ€™re even better at making predictions that are about race.

This is true of so many other characteristics. Can I mine your social network and connect you directly to someone whoâ€™s been to jail? Thatâ€™s probably predictive of credit suitability. Oh â€” black people are ~9 times more likely to be incarcerated.

Are your parents still married? Were they ever married? Thatâ€™s â€” oh.

Oh no! Youâ€™ve been transported back in time. Youâ€™re in London. Itâ€™s 1066. William, Duke of Normandy, has just now been crowned. You have a sackful of gold you can loan out. Pretty much everyone wants it, at wildly varying interest rates. Where do you place your bets?

William, right? As a savvy person, youâ€™re vaguely aware that England has a lot of troubles ahead, but generally speaking, youâ€™re betting on those who hold wealth and power to continue to do so.

Good call!

What about say, 500 years later? Same place, 1566. Late-ish Tudor period. Youâ€™re putting your money on the Tudors, while probably being really careful to not actually reminding them that theyâ€™re Tudors.

Good call!

Betting on the established power structure is always a safe bet. But this means youâ€™re also perpetuating that unjust power structure.

Two people want to start a business. Theyâ€™re equally skilled. One gets a loan at 10% interest, the other at 3%. Which is more likely to succeed?

Now is the bank even to blame for making that reasonable business decision? After all, some people are worse credit risks that others. Is the bank to disregard a higher profit margin by being realistic about the higher barriers that minorities and women face? Don’t they have a responsibility to their shareholders to look at all the factors?

That’s seductively reasonable. Too see this at scale, look at Americaâ€™s shameful history of housing discrimination. Blacks were systematically locked out of mortgages and house financing, made to pay extremely high rates without building mortgages, never building equity. At the same time, their white counterparts could buy houses, pay them off, and pass that wealth to their kids. Repeat over generations. Today, about a third of the wealth cap in families, where white families have over $100,000 in assets and minorities almost nothing, comes from the difference in home ownership.

When we evaluate risk based on factors that give us race, or class, we handicap those who have been handicapped generation after generation. We take the crimes of the past and ensure they are enforced forever.

There are things we must do.

First, know that your data will reflect the results of the discriminatory, prejudiced world that produced them. As much as possible, remove or adjust factors that reflect a history of discrimination. Don’t train on prejudice.

Second, know that you must test the application of the model: if you apply your model, are you effectively discriminating against minorities and women? If so, discard the model.

Third, recognize that a neutral, prejudice-free model might seem to test worse against past data than it will in the future, as you do things like make capital cheaper to those who have suffered in the past. Be willing to try and bet on a rosier future.

Citations on wealth disparity:

http://www.demos.org/blog/9/23/14/white-high-school-dropouts-have-more-wealth-black-and-hispanic-college-graduates

http://www.demos.org/publication/racial-wealth-gap-why-policy-matters

Hate Life, Will Travel

Occasional musings of Derek Zumsteg

Your data are racist

One thought on “Your data are racist”

Leave a Reply

Share this:

One thought on “Your data are racist”

Leave a Reply