Skip to content(if available)orjump to list(if available)

Data brokers are selling flight information to CBP and ICE

jandrewrogers

People don't grasp how easy it is to build data models like this even without privileged first-party data access.

In 2012 I created a killer prototype that demonstrated that you could accurately reconstruct most people's flight history at scale from social media and/or ad data. Probably the first of its kind. This has been possible for a long time.

A quick sketch of how it worked:

We filtered out all spatiotemporal edges in the entity graph with an implied speed of <300 kilometers per hour or <200 kilometers distance, IIRC. This was the proxy for "was on a plane". It also implicitly provided the origin and destination.

These edges can be correlated with both public flight data and maintenance IoT data from jet engines to put entities on a specific flight. People overlook the extent to which innocuous industrial IoT data can be used as a proxy for relationships in unrelated domains.

In rare cases, there was more than one plausible commercial flight. Because we had their flight history, we assumed in these cases that it was the primary airline they had used in the past, either generally or for that specific origin and destination. This almost always resolved perfectly.

This was impressively effective and it didn't require first-party data from airlines or particularly sophisticated analytics. Space and time are the primary keys of reality.

justanything

Can you eli5 the implementation and how your prototype worked?

gleenn

Sounds like if you have a record of a lot of location/timestamp data for people, you look at the distance difference divided by the time difference. Now you have average speed for any pair of points. Now filter where the average speed is as fast as a Boeing jet. That filters out most of the data except for people who are almost certainly on a plane. Et voila, you now look at those data points geolocation and you have people who traveled from one city to another because you already have the location. Compare City1 -> City2 with any public flights in those cities around those times and you know who flew on what flight from where to where and at what time.

wingspar

Honestly asking, How did you validate your results?

jandrewrogers

In this particular case it was just a proof-of-concept, albeit at scale. We did not run a proper ground-truthing process but people actually running that type of data model in production could have ground-truthed the analytic model if they wanted to.

However, it turns out that thousands of people like to talk about their flights on social media, so we scraped that as a spot check and it mostly lined up perfectly. Good enough for a demo and it would have been difficult to come up with an alternative explanation for the patterns in the data.

The purpose of the PoC was to sell the data analysis infrastructure that made that type analysis possible at scale, it wasn't about the data per se. It was a compelling demo we invented given the data that happened to be available. Startup life.

jcranmer

> Good enough for a demo and it would have been difficult to come up with an alternative explanation for the patterns in the data.

For fun edge cases, there's always Antarctica, where you can travel from a US base (which looks like you're in the US) to a NZ base (which looks like you're in NZ) in a couple of minutes: https://brr.fyi/posts/credit-card-shenanigans

fsckboy

i don't have any special knowledge in this area, but just thinking about it idly while sitting here, "robbing their homes while they are away" comes to mind as a good proxy.

animal_spirits

Reminds me of this news story of footballer John Terry who's house was robbed because he posted a picture of him on holiday. The insurance company tried to use a 'reasonable care' clause of home insurance to deny his insurance claim.

- https://www.blakefire-security.co.uk/blog/social-media-and-j...

iterance

That seems like a risk, but not a validation method, unless you are feeling particularly bold.

leblancfg

The amount and extent of data that is available out there by brokers for purchase by literally any company is *mind-boggling*. However bad you think it is, multiply that by 10.

trollied

A colleague created a banner ad that was an image that had the text “told you I could do this mate!” and targeted an individual to prove a point.

The general public have no idea how much ad providers and data brokers know about them.

blindriver

Around 2014 I worked with recruiters and they had a tool that aggregated data on everyone through LinkedIn, yelp, twitter, GitHub, eventbrite, etc. it was breathtaking the amount of information you could get on anyone, over 10+ years ago.

I’m guessing with the help of Palantir, the government has even more data and can probably link Reddit posts etc based on styleometry and can even perform psychological analysis on your personality and tendencies, etc.

JohnMakin

I work in this space - I'd say 1000x.

OsrsNeedsf2P

Could you elaborate with specifics? If it's this bad, why haven't we heard anything from a whistleblower or seen a good demo?

JohnMakin

Because none of it is really unknown? People know about it and don't care. Hell, even people on this forum that should know better and care that don't, or think when they hear about stuff like this it's FB pixel or google analytics stuff. The simple fact is with a few basic pieces of information on somebody, there's almost nothing that is sacred or not for sale. People mistakenly believe they're protected by adblockers and stuff, or by avoiding social media, but the simple fact is that it is unavoidable while simply existing and the 1000x comment is from my POV the scale of it is astounding and growing every year and people really don't have a good understanding of the subtle and not subtle ways it can affect you, or when told, don't care/dismiss it. So I don't really feel anymore like explaining it. If more people understood, I'd also stand to profit quite a bit from it, so that's where my frustrated tone is coming from.

roadside_picnic

I could give you some great horror stories, but honestly I don't see the benefit in either potentially harming former coworkers of mine that still work at those places or ending myself in some sort of career/legal trouble for something people generally don't care about (other than a few points on HN).

If you were caught demoing something both horrific and internal you would risk serious damage to your career, and ultimately will have zero impact on the industry as there's just too much data out there and too much money wrapped up in it.

Plus, most people working with the data don't bother to look at it. The places I've internally demo'd massive privacy risks were shocked because they didn't realize what their own data was capable of. Most people are just writing jobs that run and shuffle data around from one place to another never really asking "what is this data?" Even among data scientists I'm routinely surprised (so maybe I shouldn't be surprised) how frequently data scientist never do any real error analysis by looking at what the model got wrong and trying to understand why.

rapind

We hear about it all the time but no one cares.

seplox

I guess you were just distracted by all of the other house-on-fire crap going on.

https://therecord.media/ftc-complaint-against-kochava-unseal...

Among the additional information Kochava collects and sells are non-anonymized individual home addresses, phone numbers, email addresses, gender, age, ethnicity, yearly income, “economic stability,” marital status, education level, political affiliation and “interests and behaviors,” compiling and selling dossiers on individuals marketed as offering a “360-degree perspective,” the FTC said.

...

According to the FTC, Kochava’s data can identify women who visit reproductive clinics by name and address along with, for example, when they visit particular buildings, their names, email and home addresses, number of children, race and app usage.

...

Kochava marketing materials tell customers it offers “rich geo data spanning billions of devices globally” and that its location data feed “delivers raw latitude/longitude data with volumes around 94B+ geo-transactions per month, 125 million monthly active users, and 35 million daily active users, on average observing more than 90 daily transactions per device.”

...

The complaint also alleges that the company has lax procedures for determining who it is selling data to, saying purchasers are allowed to use a generic personal email address, label an alleged company as “self” and explain they plan to use the data for “business.”

And then there's this: https://therecord.media/data-brokers-are-selling-military-se...

null

[deleted]

astura

Cuz it's not really unknown nor is it illegal.

Melatonic

Anyway to combat it or stop your info from being overly harvested?

southernplaces7

I asked this same thing in another comment here, but since you mention working in this space, I ask you directly. Where do the brokers obtain their data from? If it's easy for them to obtain, would those who buy it from brokers not be able to simply get it from its respective sources? I'm genuinely curious about how this dynamic works.

jeffbee

I would say that in general the HN crowd doesn't understand the industry at all, and they need to change the direction of their understanding, rather than the magnitude. Your basic hackernews believes that e.g. Google is out there selling all your personal information. But compared to these other industries the tech industry is almost airtight. It has long been possible for someone to pick up the phone and order, in any format they want, transaction data as narrowly targeted as they wish. Credit card line items for 35-year-old dentists living on the 400 block of Elm street in local town? By end of day.

taeric

It has been truly frustrating when people will blame the "tech industry" for what is essentially reckless behavior from other industries. For a while, it was often the finance sector that did most of the crazy stuff. With crypto being an obnoxious overlap of the two.

ck_one

Is that actually possible? Can we do a live test here?

Let's say we want this dataset: Credit card line items for 35-year-old dentists living on the 400 block of Elm street in local town

How much do I have to pay you to get it?

dylan604

How much you got?

Never ask a sales person how much yo have to pay when the prices are not already clearly stated. Tell them how much you are willing to spend to see if they will do it for that amount. Sales people will always shoot high hoping to not leave money on the table. The price might change depending on how much you squeal and how high they shot. Your initial "willing to spend" should also be lower than you're actually willing to spend for the same but converse reason

supriyo-biswas

This is correct; what people fundamentally misunderstand is that data brokers directly sell personal information about people, but Google and Facebook only allow for targeted advertising while keeping personal information within the confines of their company.

jeffbee

The meta-conspiracy-theory would be that the dossier industry whips up conspiracy theories about online advertisers in order to maintain their own low profile.

everdrive

I'm also surprised that this is so hidden from everyone. Where are the engineers leaking secrets? Much of the online discourse is pure speculation based on what can be observed from the very end of the chain. (ie, what your computer is giving up) The speculation is not necessarily _incorrect_ but is too vague to be useful to anyone. Where does my data _actually_ go? Does anyone know? Can anyone describe the life of my data as it goes through the whole ecosystem? Does anyone know what mitigations are, and are not effective?

hinterlands

Because what's the headline you're going to get out of it?

If the headline is "Mark Zuckerberg is amassing your data and you know it's for evil", it's an easy sell. If it's "there's an ecosystem of little-known companies that sell transaction, location and lifestyle data to marketers, journalists, PIs, and police departments alike", it's not exactly the kind of a message that spurs people to action. And yeah, the newspaper that would be breaking the news is a customer too.

ujkhsjkdhf234

Despite being near universally hated externally, data brokering is a boring industry and is seen as very mundane and routine. They don't attract the type of engineers that have a strong moral stance and will go rogue and blow the whistle. They attract the middle age suburbanite just trying to get through the day and make a living.

Melatonic

Anyway to opt out of this type of data collection per company? I know for some things you can contact each individual broker and opt out (via some identifier like your email address) of your data being at least publicly available

sofixa

> Your basic hackernews believes that e.g. Google is out there selling all your personal information

To add to this, any mention of "telemetry" is taken to mean your PII being taken by bad actors to abuse, instead of what it is in 99% of cases, which is usage statistics. (X% of our users use feature A, it merits investment). It can be both, but there's usually no place for differentiation, just pitchforks.

mvieira38

The industry betrayed consumers' trust to the point where no project can be trusted to be mindful of data anymore. Even Proton Mail ended up ratting to the French, and that was just IP and session info, so who can we even trust to get "good telemetry"?

ctoth

> It can be both, but there's usually no place for differentiation

Fool me once, shame on you. Fool me 153,927,861 times, shame on me.

The place for differentiation, the place for "oh this is probably fine", the benefit of the doubt is, of course, lost.

Because someone (you? people shaped like you?) who misuse telemetry destroyed trust.

> It can be both

should instead be "it usually is both and you the user have no way to know anyway."

southernplaces7

Okay, and who are these people you contact for this data, and how do they themselves obtain it so precisely? You say the big tech industry is pretty air-tight about sharing data, so how does mysterious X company have on hand the credit ratings of all those youngish dentists on Elm street, among other kinds of information? How o these dynamics work, since you seem to know it internally?

southernplaces7

My question here is also how the brokers obtain the data themselves? Wouldn't it be simple for those who buy it from the brokers at a markup to just get it from its original sources themselves? Also, if the data is in any case available, the real at-fault culprits aren't so much the brokers as those who store and so easily sell it in the first instance.

roadside_picnic

> Wouldn't it be simple for those who buy it from the brokers at a markup to just get it from its original sources themselves?

In many cases joining datasets is both labor intensive and creates a surprising amount of new information, and there is also plenty of "free" data that is incredibly tedious to work with.

I used to work with real estate data for the government and if you search for any common things you might want to know you often land on a data brokers page even though property assessor data is freely available in most counties. The problem is each county has their own system of storing data and their own process for searching it. It's a lot of work to learn how just this one dataset works, combining this for all counties in the US is a massive project.

Whenever I buy a new home I always look up all my neighbors, figure out when they bought the house, how much they paid etc. Some people get freaked out by this, but this information is public in most counties.

By joining this data with another public data set, you can actually figure out which lender your neighbors used and what their reported income at time of sale, their age and ethnic background.

Of course there are plenty of other ways data brokers come across data, but even cleaning up and joining public data can require a fair bit of time and expertise.

jasode

>My question here is also how the brokers obtain the data themselves? Wouldn't it be simple [...] to just get it from its original sources themselves?

The word "broker" in "data brokers" may be misleading. They're not like a middleman-agent type of broker that is an intermediary in between buyers & sellers for transacting houses, yachts, expensive paintings, etc. They don't really act as a classical broker for matching unknown data-sellers with unknown data-buyers.

Instead, the so-called brokers when describing big data collection companies like LexisNexis and ACXIOM are actually sophisticated data hubs that gather a lot of raw data like court house filings across all counties and states, bankruptcy filings, property records that have deeds and mortgage amounts, police arrest records, vehicle registrations, driving records, employment records, etc. Some data is so raw that they have to do OCR on document images to digitize it. A lot of those sources really aren't "for sale" to the general public. The data hubs make private contractual arrangements to buy the data.[1] They also get a lot of 1st-party data directly from participating entities like retailers and insurance companies. Example would be Equifax The Work Number getting direct employment data. Equifax would be both the original source and the so-called "data broker" in that case.

The data hubs then do massive de-duplication and correlation of all those datasets to create a composite profile and id of each person. This "complete picture" of a person (age, estimated income, etc) is what marketers are buying. Even if some data is public like court records, the random marketer who wants to do a mass-mailing to advertise a new restaurant isn't interested in logging into various court houses to look at raw data. They want the composite demographics that a company like LexisNexis has.

[1] example of DMV vehicle registrations: https://www.google.com/search?q=states+dmv+sell+car+registra...

victorbjorklund

Sellers of the data wanna deal with one or a few buyers that buy bulk. They dont wanna deal with thousands of customers.

onlyrealcuzzo

Further, they are literally in the business of selling your data for a profit.

It should not be surprising that they are selling your data for a profit...

fallinditch

As far as I know there is no definitive guide for how to carry out a 'digital privacy reset' or 'digital rebirth' - but your LLM should be able to give you good instructions.

To do it properly, not only would you have to change all your logins and email accounts, but simultaneously start using a new computer and phone. Also, move home.

In other words: very hard to achieve. But I wonder if there is a set of achievable actions one can take that gets you to 'very good privacy'?

Ekaros

Siding the topic. Does anyone have any estimate how much does a regular company make for selling this data? I do not mean those focusing on advertising. But companies that willingly sell their customers data and habits?

willguest

It's amazing to me that the market for data is so well hidden from public view. So many large companies are mining and trading data on a daily basis - you would think that a data marketplace would have been a thing by now, especially with all the noise about "decentralisation" (yes, I know, crypto shill bros).

I've been touting this as a business model for years. Better still, I'd like to see it done with behavioural models (in the open). That would really blow the lid off the industry. Imagine people charging companies, instead of simply being the product...

AlexandrB

I don't get it. Why would CBP and ICE need to buy this from a data broker? The TSA is right there scanning everyone's boarding pass as part of going through security.

fnordpiglet

Beyond the other reasons stated re: regulations and law, which this government seems to be more than willing to ignore, the process of setting up reliable feeds of usable data between organizational functions can be more difficult than buying the data from an entity whose profit derives from curation and distribution of the same data. It might seem absurd on the surface but paying a premium for a repackaging of the data is often meaningfully easier and more reliable and you probably save money in the end. The TSA tech teams role isn’t to package and enrich data with useful metadata, with documentation and SLAs, and their incentives don’t naturally align no matter how hard a political appointee bangs a table. The data broker has every incentive however, and will continue to in perpetuity.

Beretta_Vexee

Because there is probably a well-defined regulatory framework for accessing data collected by the TSA, whereas there are few or no requirements when the same data is purchased from a broker.

It is not even certain that the data actually comes from the TSA. It could come from airlines, payment companies, etc.

There is no guarantee of quality when purchasing data from a broker.

mrweasel

The regulatory angle at least explains part of my wondering. I'm not really surprised that they have access to this information, I'm just surprised that they buy it, rather than just demanding it be handed over.

DistractionRect

Probably because the tsa isn't able/allowed to hand out access willy nilly.

It's kinda like how the police need warrants to request cellphone data, but cellphone companies could sell realtime data to third parties who in turn sold it to the police.

https://news.ycombinator.com/item?id=17081684

krunck

Government uses corporations to get around laws and the constitution. Corporations in turn get to use government to get around regulation. Same as it ever was.

tonymet

Suspects purchase a flight weeks + months before the flight. The TSA screens them just minutes before getting on.

Flight purchases would be critical and distinct information for law enforcement.

tgsovlerkhgsel

This could actually be interesting because in many past egregious data broker cases, the offenders had no business in the EU so they could just laugh as they were handed one 20M fine after the other (e.g. Clearview), or they were making way more than 4% of their revenue in profit from privacy violations so they could just risk the fine.

But here, the controller of the data is the airline, the transfer to the data broker might be illegal, and an airline is the worst company to commit GDPR violations with: They have a lot of global revenue but a relatively thin margin, very little of that margin comes from data abuse (so they can't just shrug off the GDPR fine as a small cost of doing shady business), and they are reachable in the EU (worst case a member state can ground and confiscate their planes, and essentially ban them from flying to the EU by threatening to confiscate any other plane that lands). And yes, Germany will impound a plane to get debts paid: https://www.reuters.com/article/world/thai-prince-to-pay-bon...

manquer

[delayed]

maCDzP

Does anyone here have some tips how to ”opt out” from this?

Melatonic

That's what I'm wondering - maybe a way to opt out when purchasing flights per airline?

pnw

It doesn't seem like you can. The airlines actually own the clearing house (ARC) that is selling the data.

almosthere

What's the lede on this story, that data brokers are selling this data or that the purchasers are ICE/CBP?

toss1

The lede is buried, and only half said:

>>"Movement unrestricted by governments is a hallmark of a free society. "

The other half of the lede is that this govt is using Insert_Method of restricting the movements of it's residents.

At this point, any persecuted activity, e.g., obtaining reproductive healthcare with a link to a person in a Red State, requires opsec procedures comparable to a CIA dark op just to not get persecuted.

ourmandave

An important part of data collection is dealing with edge cases. That's why I schedule all my travel with a layover in South Sudan.

blindriver

I have given up keeping my data private from the government. It’s impossible to avoid, so I signed up for Clear, etc because I know they have that information already.

Frankly, Clear and TSA-Pre makes my life so much easier and since I don’t commit crimes I’m not very worried… just a little worried.