Frankly Speaking, 4/23/19 -- Inherent Biases in AI

A weekly(-ish) newsletter on random thoughts in tech and research. I am an investor at Dell Technologies Capital and a recovering academic. I am interested in security, blockchain, and devops.

Apr 23, 2019

WEEKLY TECH THOUGHT

I was on vacation last week, so there is no weekly tech thought. I'm getting better at doing no work during my vacations, which is a new mentality for me. During my PhD, many people say that every day is a vacation day and every day is not a vacation day -- it all depends on how long you want to stay in the PhD program.

WEEKLY TWEET

Definitely happened a bunch in my previous software development life.

WEEKLY FRANK THOUGHT

The last couple of weeks, I've been discussing my frustration about data usage, and how data != facts because of inherent biases. My main point is that using data and providing the right context is important because data alone might be misleading. That's why data science is well... a science. With all science, there needs to be rigor and context. In academia, I would have never graduated if I just put a graph in my paper without any additional information, e.g. introduction, methodology, result explanation, disclosures, etc.

I talk about two types of biases: historical and representation. Again, much of my thinking and subsequent content is based on my reading of this paper by Harini Suresh at MIT.

This week, I'm going to talk about measurement bias. First, it's important to acknowledge that gathering good data is hard. I'm not denying that by any means. Many times, we use data as a proxy for a set of labels or features. For example, as a proxy for crime rates, we might use data on arrests. This issue that arises is most commonly known as information bias, which is more precisely known as differential measurement error. In short, this bias happens because proxies can be generated differently across groups. This can happen in a few ways.

1. The granularity of data varies across groups.
2. The quality of data varies across groups.
3. The defined classification task is an oversimplification. When you do supervised ML, you need to choose a label to predict. However, that label might not be representative of the task. For example, if you want to predict a student is successful, but the label you use is his/her GPA.

Next week, I will talk about the remaining two biases: aggregation and evaluation bias.

FUN NEWS & LINKS

#securityvclogic
“Let’s pivot one of our underperforming companies into security. This way, they can get higher multiples in the next funding round even though their revenue is low.”

#research
Antenna Theory

#tech, #security, #vclife
What you missed in cybersecurity this week.

How to land an associate job by Alex Taussig.

Should that be a microservice?

For more, follow me on Twitter, LinkedIn, and Medium.

Frankly Speaking