Surveys, anonymity and data sharing

A little now and then, maybe a few times per month, there’s some institute or company that calls: ”Hello, do you want to participate in a survey on *insert subject*, it’s anonymous and will be used to make statistics”. I’m somewhat annoyed by that promise of anonymity, since the surveys often regard sensitive stuff like political opinions, and I know that anonymity is a brittle thing.

This post is about how sharing personal information – voluntarily or unknowingly – through social media, apps, gadgets, and other means, may jeopardize that anonymity.

This is a thought experiment: Imagine that you are very cautious about who you tell how much you weigh. However, you also want to do something about your weight, so you have bought an Internet-connected scale, which uploads your weight to an account on a ”cloud” (someone else’s server) and presents it to you as a nice graph, as well as digits down to the hundredths of a gram. Not that the scale itself is that precise, but it looks good… Your account is only registered by nickname and password, not even your true email. After using said scale in the morning as usual, you go to your home-gym and turn Spotify on (for which you had to identify yourself to register) for some training music, but after a dozen song lengths of training, the telephone rings: ”Hello, do you want to participate in a survey on the link between music preferences and weight? It’s anonymous and will be used to make a report with statistics”.

Well, you assume that the statistics need some extremes too, and since it’s anonymous there’s nothing to worry about, right? ”We’ll need your name – don’t worry it won’t be linked to anything else, it’s just for reference – your weight, and the last ten songs you listened to.” You tell them your name, the last ten songs in your playlist, and the weight from this morning – every digit of it. Somewhat perplexed about the precision, but not thinking much more about it, the survey operator enters the information into the computer before continuing to call others.

Scenario one: The data is ordered by name, alphabetically. The  person anonymizing the data replaces the names with ”subject 1”, ”subject 2”, and so on. They put the full list of names in the appendix of the report, not connected to any weights or songs.

Result: Anyone who knows – or guesses – this can de-anonymize you, your music and your weight! This by alphabetizing the name list, ordering the rest by subject number and lining it up. This is only because of the clumsiness of the employee responsible for the anonymization process, and I seriously hope that there aren’t any real-world examples quite this blatant. You couldn’t have done anything here (except declining to participate, lying about your name or modifying the data), but in the next scenario, it starts getting interesting.

 

Scenario two: No names in the appendix, The report contains information of what day the survey was made.

Result: The scale company knows what your nickname listens to, Spotify can narrow the weight down to a few people, maybe identify you as its possessor. Assuming that no one else had exactly the same weight that day – which is unlikely given the ”precision” – the scale company can link your nickname to a line in the report by comparing their data with the one in the report, and see that line for the music.

Spotify can see who listened to the ten songs in the row in that order. Depending on how many people listened to the same songs in that order that day, they can get a short list of possible people, maybe just you, that is that line with that weight. The likelihood of this is pretty high – unless I’m mistaken, the number of orders someone can listen to ten songs in is 10*9*8*7*6*5*4*3*2*1=3 628 800 – over three and a half million! And that’s just the order of the same ten songs…

 

Let’s take another one.

Scenario three: The lines of data are randomized, then assigned subject numbers. Names in appendix as in scenario one, although obviously not in the same order as the weight-song-list. Information of what day the survey was made is in the report, but the order of songs are randomized by the anonymizing employee, resulting in Spotify having a total of, say, twenty people who listened to that music that day. Weight is rounded to nearest kilogram.

 

Result: The scale company doesn’t know what your nickname listened to, as your weight data in the report is no longer ”precise” enough to identify it as the same as they have, but: If none of the other nineteen who listened to the same music that day participated in the survey, Spotify can find out your weight! This by looking at everyone who listened to those songs and is also represented in the name list in the appendix. There is exactly one – you!

Disclaimers:

  • I do not accuse Spotify of doing or not doing anything with the information that they can collect when you use it. I use it as an example because it is a suitable example for this thought experiment, and because many people are familiar with it.
  • If you are about to make this exact survey, or for any other reason want to use these examples as a basis for anonymizing something, do not for one second believe that this blog post covers every possible way of abuse, or that the possible consequences listed are the only ones possible. I am only human, and I’m not professionally educated in this field – a field that is, quite frankly, complicated. Also, I have purposefully left out some things – for example the possibility of several people having the same name (which would however increase the chance of anonymity, but that’s a different story), longer link chains, and linking less-than-100%-but-still-close probabilities (compare with circumstantial evidence) – for the sake of keeping things not too unfocused and complicated.
  • I am not responsible for your anonymity – if you use this to try to keep anonymous but fail and have consequences of that – that’s not my fault. For reasons already stated, it is much more likely than not that I have missed something…

End of disclaimers

What we see in the scenarios are the result of linkage, on which Rick Falkvinge wrote a good post a while ago. (When I searched that post to link to it here, I saw that I had commented there a long time ago, about a real world study, linking music to something else…)

Back to the results of the thought experiment: As I said in the beginning, I’m somewhat annoyed by the promise of anonymity because of the complexity of this. There is someone responsible for anonymizing the results, but can you be sure that they take enough measures to do so?

He/she is only human, hopefully one that is by far more educated than me in this field and has some good computerized tools at hand, yet probably still a human who is under pressure to work quickly. This against everyone who has an interest in de-anonymizing you for as long as the information is available.

As you see, the type of information, the precision, and – what I want to point out – where else it is shared together with what other information, are factors that affect the anonymity.

In this experiment, there were only companies that could de-anonymize you except in example one (remember the disclaimers though), however when you share information openly, for example on Facebook, that possibility opens for everyone.

As you can see in the explanations of the results, anonymity isn’t as straight-forward as the survey people make it sound when they call, but there’s no magic involved in the de-anonymization. Only logic. Maybe a bit of math too, at least there can be. And you get no points for guessing what device is good, sorry, great, at very quickly computing (hint) logic and math… As computers and the programs for them get more advanced, being able to parse text and pictures and so on, these problems will be more and more apparent.

But the main point that I wanted to shed light on in this post, is that the less pieces of information floating around that are linked to your identity, or linked to something else that is in turn linked to your identity, and so on, the fewer possibilities of abuse there are. Thereby the risk of one being found and abused is reduced.

This is something that everyone who may sometime in the future have the need to publish something anonymously – that is everyone – should think about. Need to think about. The more more-or-less-useless information you share on Facebook and similar places now, the harder it will be the day you need to publish important information anonymously, and the more of what you want to publish you may need not to. Of course all depending on what information it is all about, both now and then. Also worth noting is that the ‘now’ and the ‘then’ can be reversed without difference in result – if you have already published something anonymously, and later publishes something under your name that is linkable to the first, the first is no longer anonymous.

What can you do?

As a survey operator:

  • Hire competent staff for this important task! Obviously.
  • Don’t include unnecessary information!
  • Don’t be unnecessarily precise!
  • Don’t pretend like you can’t fail unless you know that! Is it even possible to know? Unless it is possible AND you do know, don’t promise ”it’s anonymous”. Instead say something like ”we take careful measures to provide as much anonymity as we can”.
  • Don’t think that ”the risk of that is small”, ”that doesn’t happen to us”, etc! Data is worth money, and – as noted previously – computers are good at this, and getting better.

As anyone:

  • Think! Keeping every possibility of unwanted de-anonymization out is impossible for most if not all people, but if you are aware of these possibilities, you can keep many out. And, of course, the fewer possibilities there are, the less likely it is that someone or something will find one to make use of.
  • Read! Read the EULA:s and Privacy Policies before using a service. If you don’t understand them, search (preferably with DuckDuckGo instead of Google) and/or ask a friend. Ideally, these documents tell you if the company collects data, in that case what data, how it is stored, with whom it may be shared, and for what purposes it is used – exactly what you need to know to be able to think about how it could be abused. Unfortunately they are often missing some information, and  you can’t be sure that the company honors them, especially when it comes to American companies, but you are more likely to know more of what is happening with your data if you do read them than if you don’t.
  • Select! Is there a service that doesn’t require you to identify yourself? Choose that before one that does. Is there a service that uses end-to-end encryption? Choose that before one that uses non-end-to-end encryption, or even worse, no encryption at all. Is it a paid service with different payment options? Choose the one with the most anonymity such as cash in the mail or properly anonymized Bitcoin.
  • Nice, cloud-free weather: Do not unnecessarily use services or products that save and/or load things from the Internet every time you need them! If you had used an mp3 player (physical unit or program, doesn’t matter as long as it only operates locally without ”telemetry” and such euphemisms) instead of Spotify in the example, scenarios two and three would have been safe for you. At least I think so, remember the disclaimers. They would clearly have been safer anyway.
  • Refuse! This one may be hard, but has very good effect: Refrain from using services that abuse data. Quit Facebook, Twitter, Google+ etc. This gives best effect if you never used them to begin with, but a lot of data is only useful for de-anonymization for a certain amount of time, such as location, projects, weight, and maybe music preference too. This is also true for much technical data that may be used as identifiers, such as browser version, resolution (new monitor or graphics card), IMEI (new phone), etc. Be sure to tell your friends why you quit. Not only to make clear that it’s not because of them, but also to say that you do not accept whatever terms and conditions that are thrown at you, and maybe get someone to do the same!
  • Consider using analog means! This is only viable in a few cases, for example sending text or pictures to a small number of people – in which the postal service can be used. There are other abuse possibilities though, such as fingerprints and DNA, unique properties of handwriting/printer/camera etc, but physical letters don’t just cost money to send, they also cost a lot of money to analyze for such things, so it’s not done routinely, as on the Internet. It’s also usually illegal, unless there is suspicion of severe crime. This varies from country to country though. Using post instead of Internet, under normal circumstances, protects both the contents and your identity. Do you listen to radio? Do so using an FM receiver instead of web radio.Do you read a newspaper that exists both online and as a physical paper? Use the paper. (If you can download the entire digital paper and read it as a file it’s OK though, unless it requires some special reader that sends data).While the telephone is most likely under surveillance for both content (what you say) and metadata (who you call and when), it is so ”only” by the government, while many online services are so both by the government and one or more companies that collect and sell data. The telephone is not good, but depending on situation and likely adversary, it may be less bad. But post is better, and if you need voice communication there are at least some end-to-end encrypted digital services, which are probably better in most cases, but I haven’t looked much at those, so I’m not the right person to compare or recommend any particular one of them.Do you make notes for yourself and/or others you meet often, using an app? Is that app’s storage cloud-based? Pen and paper, when the note is stored on your person or other safe place, is virtually foolproof, as long as no one and nothing can see it while  you write and/or handle it.
    Worth mentioning again are payments, also in the physical world – Don’t use cards or apps, use cash!
  • Never give up! If you think that ”they already know everything, it’s no use”, you are wrong. Taking out even one single possibility of abuse may be exactly what is needed in your particular case – that one possibility may be the one that will be abused, with severe consequences for you. You never know beforehand, and usually not afterwards either, which one(s) were used. It’s just like thinking about safety: ”It’s very unlikely that I would drop this heavy brick so that it falls off the scaffolding I’m walking on at the exact time that someone passes underneath so that it hits them, but it could happen, so I’ll walk a bit further away from the edge, massively reducing the risk of it falling”.

 

If you believe I’m just being paranoid or at least overly cautious, you are wrong. This is a very real problem – there are many actors who are gathering data via the Internet and other services in order to try to make whole pictures of who is who, who is communicating with whom, who has what interests, and so on.  There are companies that do nothing else, for example Acxiom, BlueKai and BlueCava, and there are companies that provide a service for free but are funded by selling personal information, such as Facebook. Advertising is usually the main purpose, but not the only one. The data can end up anywhere, and be used to de-anonymize you again another time, easier, by the same company or someone else. Whether it is a company, a criminal, or law enforcement, de-anonymization when you want to be anonymous is never good for you.

The more types of data, the more complex the type of data, the more precise the data, and the fewer of other people with the same data, the higher the risk of de-anonymization is.

The more open the sharing between entities with different data is, the higher the risk is. Highest of course when the data is open to everyone. (This is one reason why it’s bad that Google and Facebook are buying other companies)

This isn’t something you should think about because companies are making money on you, because I say so, or because you have done something illegal – whether you have or not. You should think about it for your own privacy, freedom of speech, and future safety.

Advertisements

Taggar:, , , , , , , , , , , ,

Kommentera gärna här

Fyll i dina uppgifter nedan eller klicka på en ikon för att logga in:

WordPress.com Logo

Du kommenterar med ditt WordPress.com-konto. Logga ut / Ändra )

Twitter-bild

Du kommenterar med ditt Twitter-konto. Logga ut / Ändra )

Facebook-foto

Du kommenterar med ditt Facebook-konto. Logga ut / Ändra )

Google+ photo

Du kommenterar med ditt Google+-konto. Logga ut / Ändra )

Ansluter till %s

%d bloggare gillar detta: