Personal data collection, what is it and where it starts

First published: 2021-04-11

Last Edited: 2024-10-15

Number of edits: 14

I have been studying the subject for years. I have implemented my own data analysis algorithms and analytics service

It is easy to believe that if personal data is used to show us advertisements, it is not that bad. However, advertisements are the visible aspect of the massive accumulation of data of end consumers. Hiring decisions, credit scores, surveillance can be based on the same personal data. The only difference between them is their visibility during our daily activities.

One of the biggest challenges to creating awareness is that data analysis and statistics are poorly covered in general education. Massive data accumulators like Facebook, Google, or Amazon, hide their operations between linguistic complexity curtains. Today, programmers have an aura of smart people just because digital literacy is lagging behind. The reality is that some problems are complex. Still, many of them are not, and personal data can be exploited by anyone who shows a bit of creativity.

The discussion is complex and probably worth more than few paragraphs. Therefore, this is the first essay of a series that will focus on data, privacy, the status quo and options looking forward.

In this piece, I want to lay out the foundations of what I consider personal data and who benefit s from accumulating it. I will quickly reflect on the responsibility that individual and corporate programmers have in the decisions they make, both knowingly and unwillingly.

What is personal data

I would argue that personal data is any piece of information linked to a person, even if indirectly. For example, we can register the cars' license plates that pass through a particular spot in the city. License plates by themselves are not personal information; they belong to a car. However, cars are registered to people. Assuming that the car owner was driving, it will be a correct guess in most cases.

The examples in the online world are very similar. When we surf the internet, we get assigned a unique number called an IP address. And the IP address is public in every online interaction, pretty much like a license plate on a car. Every website we visit, every ping a program sends to check whether the license is valid will be associated with an IP address. Assuming that one IP belongs to one person will be correct in a lot of cases. IP addresses are personal data, and this is also established by the GDPR regulation in Europe.

But there is more. Cell phones get unique advertisement identifiers so that companies selling ads know we are the same person using different apps. I think no one would argue that there is a very intimate link between a phone and a person. Therefore, these ad profiles are as personal as the phone number itself. The biggest problem is that we don't get to know it, change it, or bring it to a different company.

Metadata is invisible to the eye

Metadata is personal as well

Relatively recently, companies started arguing that they do not store nor process personal data but meta-data. In the cars' examples, this would mean that instead of keeping the license plate, they record brand, color, make year of the vehicle plus the time and location it was seen. The meta-data seems impersonal in itself. However, as I tried to argue before, all information generated by a person must be considered personal.

There is a fascinating paper[@montjoye2015Unique in the shopping mall: On the reidentifiability of credit card metadata] showing that if you accumulate meta-data of credit cards, you only need 4 purchases to identify a person. This is, of course, without knowing the credit card number. The paper shows how quickly and easily meta-data can be linked to people.

The same approach works for metadata of communications, e-mails, browsing patterns. Data that is generated by a human action should, in the end, always be treated as personal data. There is always the possibility of re-identifying, de-anonymizing it, even with methods we don't have today.

In the car's case, it may seem that re-identifying everyone will be more challenging. Still, it will work for those who drive rare vehicles or have a very systematic time schedule. Bringing it to the online world, web browsers have many different properties, such as version, operating system, installed fonts, screen resolution, and time zone. Each combination is almost unique in itself.

The value of data

If we decide to collect data, the free-market understanding of the world tells us that there should be some value. So let's explore different cases where personal data delivers value. If we build a website, a blog, or a web application, the first question that will come to mind is whether someone is visiting us or not. We will have the urge to know if we are popular if people find us through Facebook or Google.

What starts as an urge for an amateur becomes a need for a professional. Understanding where people find our content helps plan marketing campaigns. Deciding what content attracts more visitors can help drive sales if we double down on it. Then, the questions we have may start increasing in complexity. Is it the same person visiting us over and over? Where are they coming from? What is the age group?

Suppose personal data already shows to have this value level for someone behind a website. In that case, we can also think about how much more value it can have in other areas. Knowing whether someone will go bankrupt in the coming 5 years is of great value if you are a financial institution. You could lower the risk of your investments if you knew more about your customers.

And these reflections can get even darker. What would happen if you had access to the medical records of the people applying for a job? It would be easy to disqualify anyone with a potential disease that would generate time off. It would create a significant bias against women who are considering becoming mothers. The fact that it is valuable for someone does not mean that it must be accessible to everyone.

Picking the example of pregnancy is not naive¹. I will pursue this line of reasoning even further in another piece. For the time being, it is essential to point out that pregnancy can be predicted with access to purchase records, not even medical records. Which links directly to the reidentifiability of meta-data discussed earlier.

Value of data collection

Where data collection starts

For the sake of argument, I'll limit myself to discussing the online world and services. Data collection, however, can happen in many different instances.

Following the same train of thought as earlier, if we have a website and want to learn something about our visitors, we will probably default to adding Google Analytics. It is a simple procedure that integrates with many website providers. Every time someone visits us, there'll be a ping to Google's servers as well. Google will collect every guest's information and, in exchange, give us the information we wanted to know.

We are not the only ones generating value. Google can know who visits virtually every website in the world. Google can extract plenty of value from this information, from consumer insights to advertisement placement, to competition analysis. The data collection that started on a personal website very quickly reached the hands of a corporation that'll do its best to extract as much value from it as possible.

Google is, however, only one of the actors involved in the data collection scheme. Facebook was limited to knowing what the users were doing within its platform. To overcome the limitation, Facebook created the like button and later the pixel that gives website owners the possibility of knowing how links and advertisements perform. In return, it provides the platform access to information coming from outside its boundaries.

Google and Facebook may be the most predominant examples of data accumulators, but they are far from the only ones. I've published a tweet complaining that a weather app I used was sharing my data with around 30 companies, including Google and Facebook:

Something as simple as a weather app (@Buienalarm) shares my personal information with about 30 other companies. Seriously? What's the limit? pic.twitter.com/cVuB1OXFgN
— Aquiles Carattino (@aquicarattino) March 7, 2021

All the other companies involved measure behavior, advertisement placement. It is also very likely that these companies resell data to aggregators. They are organizations that collect data from very different sources and do their best to glue them together. For example, they know I wrote the license plate of my car on an app for parking, and they know that the same license plate was seen in a given intersection two days ago. Now, they linked without any doubt the license plate to my phone, to me.

Data collection starts at individual decisions

Although Facebook, Google, and others will always crave more user data, it is time to acknowledge that they are enabled by individuals. Webmasters and app developers get the chip that says more data is always better. What they don't realize is that it comes at a collective cost.

Setting up my own analytics service for my websites took me around a week, and I am not the most proficient web developer. Bloggers probably don't have the same access to the know-how, and it is understandable. Still, other web and app developers should start reflecting on the decisions they make.

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/ ↩

Backlinks

These are the other notes that link to this one.

Nothing links here, how did you reach this page then?

Comment

Share your thoughts on this essay. Comments are not public, they are messages sent directly to my inbox.

Aquiles Carattino

This note you are reading is part of my digital garden. Follow the links to learn more, and remember that these notes evolve over time. After all, this website is not a blog.