Author Topic: Bye Bye Github (Read 4425 times)

PlainName · « **on:** July 01, 2022, 05:44:57 pm »

Open source bods want people to quit Github because Github/Microsoft abuse open source licences to sell proprietary software without attribution or respecting copyleft licensing. There's a surprise!

https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/

The referenced URL in the article ( https://sfconservancy.org/GiveUpGitHub/ ) is worth a read if you're not sure what this is about.

ataradov · « **Reply #1 on:** July 01, 2022, 05:50:01 pm »

All my code under BSD-3 license, so they are free to legally steal it all they want. They won't be in compliance with the attribution clause, but whatever.

For me GitHub is a convenient place to give away my code. Any code I don't want to be used by others is not going to be published anywhere.

SiliconWizard · « **Reply #2 on:** July 01, 2022, 05:55:03 pm »

Ah, some here already know what I think of github.
This is a nail in the coffin.

bd139 · « **Reply #3 on:** July 01, 2022, 06:08:48 pm »

I closed my personal GitHub account about 3 months ago.

Stuck with it on a corporate account. It’s not going down well though. Constant actions outages, poor security controls and the UI is dog shit.

ebastler · « **Reply #4 on:** July 01, 2022, 06:43:09 pm »

Could someone please help me understand what the bone of contention is? It seems to be that the "Copilot" AI model has been trained on 3rd party open source software, right?

Does that really constitute a use of the 3rd party software which violates the respective licenses? Software is protected by copyright -- which conveys the right to the copyright holder to make that software available to others under a license of their chosing.

But the copyright only stops others from copying the software (or non-trivial pieces thereof) verbatim, right? I can still study the software, I can learn from it, I can create my own implementations by taking cues or (rather small) parts from it.

Isn't that just what the AI does, and isn't that therefore perfectly acceptable use of the 3rd party software? Why would that constitute a copyright violation, and why would it require permission under whichever type of license was used?

bd139 · « **Reply #5 on:** July 01, 2022, 06:47:13 pm »

Yes. It has been demonstrated to write restricted open source code verbatim. This is a massive risk from a liability and legal point of view. It attracts litigation so people want to distance themselves from it.

If you’re doing reverse engineering you only ever observe the black box abstraction or API for these reasons.

ataradov · « **Reply #6 on:** July 01, 2022, 06:51:28 pm »

Quote from: ebastler on July 01, 2022, 06:43:09 pm

Does that really constitute a use of the 3rd party software which violates the respective licenses?

Copilot is known to spit out significant chunks of code verbatim, including typos from the original. Whether those chunks are enough to call it IP violation I personally don't know. It probably depends.

But just having that be an unanswered question is the reason enough to not use Copilot for anything serious. You are just inviting to be sued.

RoGeorge · « **Reply #7 on:** July 01, 2022, 06:52:46 pm »

Quote

"While we will not mandate our existing member projects to move at this time, we will no longer accept new member projects that do not have a long-term plan to migrate away from GitHub," said Gingerich and Kuhn.

Source: https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/

Never heard of "SFC" before, but this sounds like hypocrisy, IMHO.

(and I'm no fan of Microsoft, only sick of "mandate"-ers and cancelots)

About GitHub Copilot, it happens that I've just looked into AI code helpers few days ago. GH Copilot was acquired recently from OpenAI. OpenAI is a startup from the dudes that came with GPT-1 and GPT-2, then later GPT-3 (GPT-3 is not FOSS, the other two GPTs are). GPT-3 Codex is the OpenAI product, a GPT-3 type AI that was specialized to generate software starting from plain English.

Codex/Copilot was trained on pretty much all the available sources, including Github and Stackoverflow. It's hard to tell if copyrighted software was used for training, and the Copilot generated code is not verbatim (GPT-3 is a generator type of AI, so it mixes all it learned and it's not a copy paste code retrieval), so it's hard to prove license infringement.

At first, GH Copilot was free for all to use (like some sort of beta), and starting this year Microsoft sells Copilot access as a subscription, for about $10 a months or $100/year. Apparently the free beta was a big success, very productive and almost addictive to many. However, Copilot and future similar products are not expected to be free, they require a lot of hardware to run inference (for example, minimum HW requirements for GPT-NeoX - a similar but smaller AI than GH Copilot, will require about 50GB of RAM and minimum 2 GPU with 50GB VRAM to answer in 1-3 seconds). For training, it needs big clusters of GPUs and CPUs, and also piles and piles of source code to train on. This is way more demanding than free hosting some projects on GitHub, so I don't think the AI GH's Copilot will be offered for free.

bd139 · « **Reply #8 on:** July 01, 2022, 06:55:06 pm »

This and anything OpenAI builds is cancer for society.

All it does is fuck up the signal to noise ratio.

ebastler · « **Reply #9 on:** July 01, 2022, 07:06:05 pm »

Quote from: bd139 on July 01, 2022, 06:47:13 pm

It has been demonstrated to write restricted open source code verbatim.

Quote from: ataradov on July 01, 2022, 06:51:28 pm

Copilot is known to spit out significant chunks of code verbatim, including typos from the original.

Quote from: RoGeorge on July 01, 2022, 06:52:46 pm

[...] the Copilot generated code is not verbatim

Now I'm confused.

bd139 · « **Reply #10 on:** July 01, 2022, 07:07:02 pm »

It is definitely 100% verbatim. We repro'ed it in house.

madires · « **Reply #11 on:** July 01, 2022, 07:20:11 pm »

Quote from: ebastler on July 01, 2022, 06:43:09 pm

Isn't that just what the AI does, and isn't that therefore perfectly acceptable use of the 3rd party software? Why would that constitute a copyright violation, and why would it require permission under whichever type of license was used?

Actually there are two things to consider: copyright and license. In some countries copyright law allows to publish a brief excerpt of a copyrighted work (AKA quote, in the US: fair use), possibly requiring to note the source. Many software licenses are talking about a 'derived work' which is a broad term and would comprise Copilot. The question is, how large the excerpt may be to be still acceptable under copyright law.

RoGeorge · « **Reply #12 on:** July 01, 2022, 07:22:04 pm »

Quote from: ebastler on July 01, 2022, 07:06:05 pm

Quote from: bd139 on July 01, 2022, 06:47:13 pm
It has been demonstrated to write restricted open source code verbatim.

Quote from: ataradov on July 01, 2022, 06:51:28 pm
Copilot is known to spit out significant chunks of code verbatim, including typos from the original.

Quote from: RoGeorge on July 01, 2022, 06:52:46 pm
[...] the Copilot generated code is not verbatim

Now I'm confused.

My bad, it's not always verbatim, but could be sometimes.

Also it usually generates different code for the same request if the request is repeated. Also depends of how big the training chunks of code were (maybe they retrained it with smaller chunks during that one year free beta, maybe they removed from training those projects with more demanding licenses). Also there is an adjustable "temperature" parameter for inference queries, something similar with "creativity" for humans.

Not easy to prove license infringement against an army of Microsoft lawyers, but yes, AI helpers can copy typo mistakes or full lines, then mix that (or not) with something else, and so on.

The AI behaves similar with a kid learning to speak. Sometimes the kids are using the exact words or idioms they hear, but often they combine the words in new ways.

ebastler · « **Reply #13 on:** July 01, 2022, 07:31:35 pm »

Quote from: madires on July 01, 2022, 07:20:11 pm

Quote from: ebastler on July 01, 2022, 06:43:09 pm
Isn't that just what the AI does, and isn't that therefore perfectly acceptable use of the 3rd party software? Why would that constitute a copyright violation, and why would it require permission under whichever type of license was used?

Actually there are two things to consider: copyright and license. In some countries copyright law allows to publish a brief excerpt of a copyrighted work (AKA quote, in the US: fair use), possibly requiring to note the source. Many software licenses are talking about a 'derived work' which is a broad term and would comprise Copilot. The question is, how large the excerpt may be to be still acceptable under copyright law.

I think we are talking about the same thing there. Of course copyright and licenses are different things. But use of software in a way which is allowed under copyright (e.g. short quotes, paraphrasing of ideas bit in a different form...) cannot be restricted by the license.

Hence, if "Copilot" would not reproduce verbatim the software which it was trained on, but only paraphrase it, that should not be an issue. But there seems to be a broad consensus that Copilot does produce chunks of verbatim copies, and at least widespread concerns that these chunks are substantial enough to consitute copyright violations. (And hence license violations unless the license is very generous).

PlainName · « **Reply #14 on:** July 01, 2022, 07:32:22 pm »

Quote

It seems to be that the "Copilot" AI model has been trained on 3rd party open source software, right?

Does that really constitute a use of the 3rd party software which violates the respective licenses? Software is protected by copyright -- which conveys the right to the copyright holder to make that software available to others under a license of their chosing.

Some of the code it used was copylefted, so anything derived from that must at least give proper attribution. Some of the code even goes as far a to prevent use in a server scenario.

There is an argument (pushed by github, of course) that the copilot stuff is a bit like a compiler and you can't (well, you could, but no-one really does) restrict what, say, gcc does. But that misses the point that the vast code database they snaffled is essentially equivalent to the source input to the compiler, and the output certainly does retain the licensing restrictions placed on the source.

The main beef seems to be that github, whose raison d'etre is to manage open source, uses that source to produce a closed source app they then flog to make lots of wonga off, without giving a shit about any license terms that might be involved. If you ignore the legal aspect, it's pretty shit morally. Is that really who you want to be the de facto controller of open source? Because that's what they are - although the point of git was to do away with centralised control, github is exactly that centralised repository and controller, and managed by Microsoft to boot!

SiliconWizard · « **Reply #15 on:** July 01, 2022, 07:45:57 pm »

Even if it didn't spit out code *verbatim*, automatically feeding AI with tons of source code under various licenses, ultimately for being reused (even if in some "digested" form) in projects with incompatible licenses, is already a very serious and questionable matter IMO.

It's not just about using tricks to *barely* get away from copyright infringement, which I have no doubt the "AI" used here will manage over time. It won't be too hard to make it learn to rewrite pieces of code in a manner that would make said code hard to recognize for humans. After a few nasty lawsuits, engineers working on this "AI" will definitely be mandated to work on this.

Can some automated transformation of copyrighted material be considered original work? Uh, yeah. That can of worms looks pretty nice.

But this will likely be an endless race - unless politics starts to meddle - so, over time, as ML is going to be used increasingly, you can also expect ML to be used to spot copyright infrigements that would otherwise be hard to spot by humans only.

There's a lot to consider here - I'm just scratching the surface.

PlainName · « **Reply #16 on:** July 01, 2022, 07:55:44 pm »

Bloke (or blokess, or I guess just a blok) on twitter has pointed out their two-facedness:

Quote

This would seem to imply that it's OK to take leaked Windows source code and train an ML model on it, and release that to the public.

Is that your interpretation, and if not, how does it differ from doing the same for code under the GPL and stripping the GPL?

bd139 · « **Reply #17 on:** July 01, 2022, 07:56:45 pm »

Quote from: dunkemhigh on July 01, 2022, 07:55:44 pm

Bloke (or blokess, or I guess just a blok) on twitter has pointed out their two-facedness:

Quote
This would seem to imply that it's OK to take leaked Windows source code and train an ML model on it, and release that to the public.

Is that your interpretation, and if not, how does it differ from doing the same for code under the GPL and stripping the GPL?

Well they probably scanned that too

https://github.com/onein528/NT5.1

https://github.com/ZoloZiak/WinNT4

Has been useful having that code to refer to.

RoGeorge · « **Reply #18 on:** July 01, 2022, 08:27:51 pm »

About the ethics, useful AI applications are pretty new, so not yet regulated by law. For now, who has the deepest pockets win.

In fact, there is at least one AI regulation already, it's the law allowing autonomous AI weapons.

Not kidding, that's a fact. Combine this with mass surveillance of individuals, and you'll get a future of very dystopian obedience.

The law passed in USA about a year ago, and was justified as a necessity, as a "zero sum game" against similar autonomous AI weapons that are presumably developed by other nations. Understandable. If a new technology appears, sooner or later that will be used as a weapon.

jpanhalt · « **Reply #19 on:** July 01, 2022, 08:28:11 pm »

There's money and then there is being honest. When I wrote by dissertation in 1970, I gave credit to chemists who published in the 1890's. No money or plagiarism involved. But, had I omitted those references, I am sure my review committee would have had, at a minimum, raised eyebrows. Make attribution a requirement with clearly defined consequences.

In the US, the way to do that is to establish a reasonable "liquidated damages" clause that is enforceable of say $10,000 per violation for non-attribution. Make the amount enough to discourage the practice of plagiarism but not ridiculous. Period.

golden_labels · « **Reply #20 on:** July 01, 2022, 08:46:41 pm »

The code generated by Copilot currently seeems to not be a work, due to how Berne Convention works. That’s in line with earlier US Copyright Office’s opinions and US courts’ verdicts, and confirmed in the most recent case.⁽¹⁾⁽²⁾ US law is not binding to court elsewhere, but:

For the matter discussed the copyright law in United States is the primary concern, as this is where “everything happens”.
That is tightly bound to the very core of the Berne Convention, which applies virtually everywhere on this planet.

If that is not a work, I find the entire discussion about licensing breach by the generated code moot.

What in my opinion is more important, is whether a trained model is considered a derived work. I see that as much more interesting: if only a binary answer is possible, I can conceive three options:

A trained model is not a work at all
Implies: under the existing copyright regime it has no protection. A new set of IP laws may be forged, but that takes time and gives an opportunity to influence them considerably easier than what is possible with copyright.
A trained model is a work, and is a derived work
Implies: a requirement to abide to licensing terms, including both attribution and granting various rights to the licensees.
A trained model is a work, and is not a derived work
Implies: opposition on these grounds is not possible and creators of later models may use works for training in a similar fashion.

One of the contention points with Copilot is also that it was trained on FOSS sources and data leeched from community, but avoided touching proprietary works to which GitHub also has access. Just because that’s possibly legal doesn’t mean it is perceived as acceptable. My own gripe with Copilot is of different nature. Machine learning on that scale is a relatively new subject facing many philosophical challenges. I think it is still too early to definitely say that Copilot significally differs, in qualitative terms, from a programmer acquiring knowledge from reading sources. The pain, which I find understated in opposition, is the possibility of making software development dependent on such tools. You may say “it’s like a calculator to a mathematician”. But it becomes a problem if — in order to be able to compete with other developers — you are forced to use that calculator and it’s almost guaranteed there will only be a few calculator manufacturers, which use their position to push abusive licensing terms.

Returning to the first paragraph, that situation gives rise to another interesting and very complex situation. If Copilot’s output is not a copyrightable work, any program written with it seem to contain fragments that can’t be protected by copyright. What could that imply if owner of the entire program claims infrigement, but it’s found it only applied to such a fragment? What if the defendant could prove that? If that’s a possibility, whose obligation is it to make the proof and how should it look like? Though hypothetical and a thought experiment in nature, extreme cases to that problem are pretty intriguing.

⁽¹⁾ https://www.theverge.com/2022/2/21/22944335/us-copyright-office-reject-ai-generated-art-recent-entrance-to-paradise
⁽²⁾ Second Request for Reconsideration for Refusal to Register A Recent Entrance to Paradise, US Copyright Office (February 2022)

bd139 · « **Reply #21 on:** July 01, 2022, 08:53:05 pm »

It doesn't matter if it is or not the legal definition of a work or not.

What matters is how much of your capital you burn when someone accuses you of a license violation. This opens up a whole new world of litigation opportunities so the only option with respect to utilising what the tool outputs is to ban it under policy.

As I mentioned earlier it also has an impact in quality and as far as the normal function of society.

I'm not suggesting we go all Dune on this thing's ass, but ML should be relegated only for assistance (inference only) and never for creating new works of any kind. That includes GPT-n derivatives, this garbage and image generation. All we end up doing is cheaply corrupting society with a new decay cycle.

RoGeorge · « **Reply #22 on:** July 01, 2022, 09:20:38 pm »

I understand that concern, but AI code companions are not much different from an online search then copy/adapt the found code.

I guess in some years from now people will wonder how was even possible to manually write code by just reading the docs.

"How did programmers survive prior to AI companions era?"
( https://www.eevblog.com/forum/microcontrollers/how-did-you-survive-prior-to-the-internet-making-information-easy-to-find/ )

AndyBeez · « **Reply #23 on:** July 01, 2022, 09:25:07 pm »

Quote from: bd139 on July 01, 2022, 08:53:05 pm

It doesn't matter if it is or not the legal definition of a work or not. What matters is how much of your capital you burn when someone accuses you of a license violation. This opens up a whole new world of litigation opportunities so the only option with respect to utilising what the tool outputs is to ban it under policy.

And of those litigants, around 1 in 1000 understand what software actually is. I bailed on **ithub back before Covid. The whole space was becoming an overloaded code dumpster. Version control, wot dat?

Codepilot is clever. Which is why it's the GOSUB for programmers who wish to graduate their homework from javascript to python. But for a pro developer, there's always the feeling that the solution requires a brain not a farmyard of GPU's/CPUs. Organic programmers are cleverer.

As for the problem of code plagerism. Recently YouTuber Fran Blanche a.k.a. FRANLAB received a copyright strike after the sound of the wind on the public domain film she aired was owned by Dizony. It wasn't, but she couldn't prove otherwise. So what happens when some dumb A.I. bot finds a conditional, collection or object using the same variable name as code from one of their client's code dung heaps? Is **itHub asked to remove the offending code?

So is this code plagerism?

Code: [Select]

printf "Hello World"

bd139 · « **Reply #24 on:** July 01, 2022, 09:30:04 pm »

Well I dumped github because it has been overrun by ticket closing bots. It's impossible getting anything fixed or getting any attention and getting steamrolled by vendors was all too common. They close or delete the tickets. I found a large configuration bug in .Net Core which lead to telemetry being leaked out and my ticket was shitcanned

. Then this...

A good secondary point there. This is going to be weaponised. The ML (i refuse to use the term AI here) code writers will be against the ML litigation finders.

The best answer is to keep your code inside your security perimeter.

Edit: mini rant. Oh and most of the actual valuable open source projects appear to be run by slightly more deranged versions of comic book guy from the Simpsons. Decided I'd stick to commercial unix variants after that as I know enough people at the company in question to be able to slide stuff in through the back door.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Bye Bye Github (Read 4425 times)

Share me