Skip to content(if available)orjump to list(if available)

OSI readies controversial open-source AI definition

didibus

> Maybe the supporter of the definition could demonstrate practically modifying a ML model without using the original training data, and show that it is just as easy as with the original data and it does not limit what you can do with it (e.g. demonstrate it can unlearn any parts of the original data as if they were not used).

I quite like that comment that was left on the article. I know some models you can tweak the weights, without the source data, but it does seem like you are more restricted without the actual dataset.

Personally, the data seems to be part of the source to me, in this case. I mean, the code is derived from the data itself, the weights are the artifact of training. If anything, they should provide the data, the training methodology, the model architecture, the code to train and infer, and the weights could be optional. I mean, the weights basically are equivalent to a built artifact, like the compiled software.

And that means commercially, people would pay for the cost of training. I might not have the resources to "compile" it myself, aka, run the training, so maybe I pay a subscription to a service that did.

lolinder

A lot of people get hung up on `weights = compiled-artifact` because both are binary representations, but there are major limitations to this comparison.

When we're dealing with source code, the cost of getting from source -> binary is minimal. The entire Linux kernel builds in two hours on one modest machine. Since it's cheap to compile and the source code is itself legible, the source code is the preferred form for making modifications.

This doesn't work when we try to apply the same reasoning to `training data -> weights`. "Compilation" in this world costs hundreds of millions of dollars per compilation run. Cost of "compilation" alone means that the preferred form for making modifications can't possibly be the training data, even for the company that built the thing in the first place. As for the data itself, it's a far cry from source code—we're talking tens of terrabytes of data at a minimum, which is likewise infeasible to work with on a regular basis. The weights must be the preferred form for making modifications for simple logistics reasons.

Importantly, the weights are the preferred form for modifications even for the companies that built them.

I think a far more reasonable analogy, to the extent that any are reasonable, is that the training data is all the stuff that the developers of the FOSS software ever learned, and the thousands of computer-hours spent on training are the thousands of man-hours spent coding. The entire point of FOSS is for a few experts to do all that work once and then we all can share and modify the output of those years of work and millions of dollars invested as we see fit, without having to waste all that time and money doing it over again.

We don't expect the authors of the Linux kernel to document their every waking thought so we could recreate the process they used to produce the kernel code... we just thank them for the kernel code and contribute to it as best we can.

nextaccountic

The source is really the training data plus all code required to train the model. I might not have resources to "compile", and also "compilation" is not deterministic, but those are technical details

You could have a programming language whose compiler is a superoptimizer that's very slow and is also stochastic, and it would amount to the same thing in practice.

a2128

The usefulness of data here is that you can retrain the model after making changes to its architecture, e.g. seeing if it works better with a different activation function. Of course this is most useful for models small enough that you could train it within a few days on a consumer GPU. When it comes to LLMs only the richest companies would have the adequate resources to retrain.

samj

The OSI apparently doesn't have the mandate from its members to even work on this, let alone approve it.

The community is starting to regroup at https://discuss.opensourcedefinition.org because the OSI's own forums are now heavily censored.

I encourage you to join the discussion about the future of Open Source, the first option being to keep everything as is.

jart

OSI must defend the open source trademark. Otherwise the community loses everything.

The legal system in the US doesn't provide them any other options but to act.

justinclift

For reference, this is the OSI Forum mentioned: https://discuss.opensource.org

Didn't personally know they even had one. ;)

abecedarius

The side note on hidden backdoors links to a paper that apparently goes beyond the usual ordinary point that reverse engineering is harder without source:

> We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer.

(I didn't read the paper. The ordinary version of this point is already compelling imo, given the current state of the art of reverse-engineering large models.)

Terr_

Reminds me of a saying about bugs adapted from this bit from Tony Hoare:

> I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.

My impression is that LLMs are very much the latter-case, with respect to unwanted behaviors. You can't audit them, you can't secure them, and whatever "control" we have over the LSD-trip-generator involves a lot of arbitrary trial and error and hoping our luck holds.

blogmxc

OSI sponsors include Meta, Microsoft, Salesforce and many others. It would seem unlikely that they'd demand the training data to be free and available.

Well, another org is getting directors' salaries while open source writers get nothing.

wmf

On one hand if you require people to provide data they just won't. People will never provide the data because it's full of smoking guns.

On the other hand if the data isn't open you should probably use the term open weights not open source. They're so close.

samj

Yes, and Open Source started out with a much smaller set of software that has since grown exponentially thanks to the meaningful Open Source Definition.

We risk giving AI the same opportunity to grow in an open direction, and by our own hand. Massive own goal.

skissane

> On one hand if you require people to provide data they just won't. People will never provide the data because it's full of smoking guns.

Tangential, but I wonder how well an AI performs when trained on genuine human data, versus a synthetic data set of AI-generated texts.

If performance when trained on the synthetic data set is close to that when trained on the original human dataset – this could be a good way to "launder" the original training data and reduce any potential legal issues with it.

dartos

I believe there are several papers which show that synthetic data isn’t as good as real data.

It makes sense as any bias in the model generated synthetic data will just get magnified as models are continuously trained on that biased data.

mistrial9

> ... require people to provide data they just won't. People will never provide the data ...

the word "people" is so striking here... teams and companies, corporations and governments.. how can the cast of characters be so completely missed. An extreme opposite to a far previous era where one person could only be their group member. Vocabulary has to evolve in deliberations.

aithrowawaycomm

What I find frustrating is that this isn't just about pedantry - you can't meaningfully audit an "open-source" model for security or reliability problems if you don't know what's in the training data. I believe that should be the "know it when I see it" test for open-source: has enough information been released for a competent programmer (or team) to understand the how the software actually works?

I understand the analogy to other types of critical data often not included in open-source distros (e.g Quake III's source is GPL but its resources like textures are not, as mentioned in the article). The distinction is in these cases the data does not clarify anything about the functioning of the engine, nor does its absence obscure anything. So by my earlier smell test it makes sense to say Quake III is open source.

But open-sourcing a transformer ANN without the training data tells us almost nothing about the internal functioning of the software. The exact same source code might be a medical diagnosis machine, or a simple translator. It does not pass my smell test to say this counts as "open source." It makes more sense to say that ANNs are data-as-code programming paradigms, glued together by a bit of Python. An analogy would be if id released its build scripts and announced Quake III was open-source, but claimed the .cpp and .h files were proprietary data. The batch scripts tell you a lot of useful info - maybe even that Q3 has a client-server architecture - but they don't tell you that the game is an FPS, let alone the tricks and foibles in its renderer.

lolinder

> I believe that should be the "know it when I see it" test for open-source: has enough information been released for a competent programmer (or team) to understand the how the software actually works?

Training data simply does not help you here. Our existing architectures are not explainable or auditable in any meaningful way, training data or no training data.

aithrowawaycomm

I don't think your comment is really true, LLM providers and researchers have been a bit too eager to claim their software is mystically complex. Anthropic's research is shedding light on interpretability, there has been good work done on the computational complexity side, and I am quite confident that the issue is LLM's newness and complexity, not that the problem is actually intractable (or specifically "more intractable" than other hopelessly complex software like Facebook or Windows).

To the extent the problem is intractable, I think kt mostly reflects that LLMs have an enormous amount of training data and do an enormous amount of things. But for a given specific problem the training data can tell you a lot:

- whether there is test contamination with respect to LLM benchmarks or other assessments of performance

- whether there's any CSAM, racist rants, or other things you don't want

- whether LLM weakness in a certain domain is due to an absence of data or if there's a more serious issue

- whether LLM strength in a domain is due to unusually large amounts of synthetic training data and hence might not generalize very reliably in production (this is distinct from test contamination - it is issues like "the LLM is great at multiplication until you get to 8 digits, and after 12 digits it's useless")

- investigating oddness like that LeetMagikarp (or whatever) glitch in ChatGPT

samj

That's why Open Source analyst Redmonk now "do not believe the term open source can or should be extended into the AI world." https://redmonk.com/sogrady/2024/10/22/from-open-source-to-a...

I don't necessarily agree and suggest the Open Source Definition could be extended to cover data in general (media, databases, and yes, models) with a single sentence, but the lowest risk option is to not touch something that has worked well for a quarter century.

The community is starting to regroup and discuss possible next steps over at https://discuss.opensourcedefinition.org

blackeyeblitzar

But training data can itself be examined for biases, and the curation of data also brings in biased. Auditing the software this way doesn’t require explainability in the way you’re talking about.

swyx

i like this style of article with extensive citing of original sources.

previously on: https://news.ycombinator.com/item?id=41791426

its really interesting to contrast this "outsider" definition of open ai with people with real money at stake https://news.ycombinator.com/item?id=41046773

didibus

> its really interesting to contrast this "outsider" definition of open ai with people with real money at stake

I guess this is a question of what we want out of "open source". Companies want to make money. Their asset is data, access to customers, hardware and integration. They want to "open source" models, so that other people improve their models for free, and then they can take them back, and sell them, or build something profitable using them.

The idea is that, like with other software, eventually, the open source version becomes the best, or just as good as the commercial ones, and companies that build on top no longer have to pay for those, and can use the open source ones.

But if what you want out of "open source" is open knowledge, peeking at how something is built, and being able to take that and fork it for your own. Well, you kind of need the data. And your goal in this case is more freedom, using things that you have full access to inspect, alter, repair, modify, etc.

To me, both are valid, we just need a name for one and a name for the other, and then we can clearly filter for what we are looking for.

Legend2440

Does "open-source" even make sense as a category for AI models? There isn't really a source code in the traditional sense.

mistrial9

I have heard government people talk about "the data is open-source" meaning it has public, no cost copy points to get data files e.g. csv or other.

paulddraper

Yeah, it's like an open-source jacket.

I don't really know what you're referring to....

JumpCrisscross

> After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go

Huh, then this will be a useful definition.

The FSF position is untenable. Sure, it’s philosophically pure. But given a choice between a practical definition and a pedantically-correct but useless one, people will use the former. Irrespective of what some organisation claims.

> would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate

Not how language works.

SrslyJosh

Indeed it will be a useful definition, as this comment noted above: https://news.ycombinator.com/item?id=41951573

blackeyeblitzar

I don’t understand why the “practical” reality requires using the phrase “open source” then. It’s not open source. That label is false and fraudulent if you can’t produce the same artifact or approximately the same artifact. The data is part of the source for models.

tourmalinetaco

It is in no way useful for the advancement of MLMs. Training data is literally the closest thing to source code MLMs have and to say it’s a “benefit” rather than a requirement only allows for the moat to be maintained. The OSI doesn’t care about the creation of truly free models, only what benefits companies like Facebook or IBM who release model weights but don’t open up the training data.

chrisfosterelli

I imagine that Open AI (the company) must really not like this.

talldayo

I hate OpenAI but Sam Altman is probably giddy with excitement watching the Open Source pundits fight about weights being "good enough". He's suffered the criticism over his brand for years but they own the trademark and openly have no fucks to give about the matter. Founding OpenAI more than 5 years before Open AI was defined is probably another perverse laurel he wears.

At the end of the day, what threatens OpenAI is falling apart before they hit the runway. They can't lose the Microsoft deal, they can't lose more founders (almost literally at this point) and they can't afford to let their big-ticket partnerships collapse. They are financially unstable even by Valley standards - one year in a down market could decimate them.

AlienRobot

If I remember correctly, Stallman's whole point about FLOSS was that consumers were beholden to developers who monopolized the means to produce binaries.

If I can't reproduce the model, I'm beholden to whoever trained it.

>"If you're explaining, you're losing."

That is an interesting point, but isn't this the same organization that makes "open source" vs. "source available" a topic? e.g. why Winamp wouldn't be open source?

I don't think you can even call a trained AI model "source available." To me the "source" is the training data. The model is as much of a binary as machine code. It doesn't even feel right to have it GPL licensed like code. I think it should get the same license you would give to a fractal art released to the public, e.g. CC.

alwayslikethis

It's not clear that copyright applies to model weights at all, given they are generated by a computer and isn't really a creative work. It is closer to a quantitative description of the underlying data, like a dictionary or word frequency list.

andrewmcwatters

I’m sure this will be controversial for some reason, but I think we should mostly reject the OSI’s definitions of “open” anything and leave that to the engineering public.

I don’t need a board to tell me what’s open.

And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.

I don’t need people to tell me that.

OSI approved this and that has turned into a Ministry of Magic approved thinking situation that feels gross to me.

didibus

I agree. If it's open source, surely I can at least "compile" it myself. If the data is missing, I can't do that.

We'll end up with like 5 versions of the same "open source" model, all performing differently because they're all built with their own dataset. And yet, none of those will be considered a fork lol?

I don't know what the obsession is either. If you don't want to give others permission to use and modify everything that was used to build the program, why are you wanting to trick me in thinking you are, and still calling it open source?

strangecasts

> And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.

Making training exactly reproducible locks off a lot of optimizations, you are practically not going to get bit-for-bit reproducibility for nontrivial models

didibus

That's kind of true for normal programs as well, depending on the compiler you use, and if it has non-deterministic processes in it's compilation. But still, it's about being able to reproduce the same build process, and get a true realization of the program, even if not bit-for-bit, it's the same intended program.

samj

Nobody's asking for exact reproducibility — if the source code produces the software and it's appropriately licensed then it's Open Source.

Similarly, if you run the scripts and it produces the model then it's Open Source that happens to be AI.

To quote Bruce Perens (definition author): the training data IS the source code. Not a perfect analogy but better than a recipe calling for unicorn horns (e.g., FB/IG social graphs) and other toxic candy (e.g., NYT articles that will get users sued).

rockskon

To be fair, OSI approval also deters marketing teams from watering down the definition of open source into worthless feelgood slop.

tourmalinetaco

That’s already what’s happened though, even with MLMs. Without training data we’re back to modifying a binary file without the original source.

JumpCrisscross

> if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open

This is the new cracker/hacker, GIF pronunciation, crypto(currency)/crypto(graphy) mole hill. Like sure, nobody forces you to recognise any word. But the common usage already precludes open training data—that will only get more ensconced as more contracts and jurisdictions embrace it.

mistrial9

in historical warfare, Roman soldiers easily and brutally defeated brave, individualist and social opponents on the battlefield, arguably in markets afterwards. It is a sad and essential lesson that applies to modern situations.

In marketing terms, a simple market communication, consistently and diligently applied, in varied contexts and over time, can and usually will take hold despite untold number of individuals who shake their fists at the sky or cut with clever and cruel words that few hear IMHO

OSI branding and market communications seem very likely to me to be effective in the future, even if the content is exactly what is being objected to here so vehemently.