Email or username:

Password:

Forgot your password?
248 posts total
Simon Willison

Fascinating report from 404 Media's Samantha Cole on a trove of leaked NVIDIA Slack messages and emails about how they're scraping millions of YouTube videos to train their own new foundation video generation model: 404media.co/nvidia-ai-scraping

Posted a few of my own notes here: simonwillison.net/2024/Aug/5/n

It's not surprising to learn that they're doing this - that's practically the industry standard right now - but is still really interesting to see internal details of what they're collecting and why

Fascinating report from 404 Media's Samantha Cole on a trove of leaked NVIDIA Slack messages and emails about how they're scraping millions of YouTube videos to train their own new foundation video generation model: 404media.co/nvidia-ai-scraping

Posted a few of my own notes here: simonwillison.net/2024/Aug/5/n

phillmv

@simon

every now and then i feel like im taking crazy pills because i remember when aaron swartz killed himself because he was going to go to jail forever because he scraped JSTOR,

and eleven years later your manager tells you “sshhhh it’s fine just scrape all of it don’t worry the CEO said it’s fine”

DELETED

@simon
Do you think Nvidia might become more chaos focused in future?

I genuinely worry about big AI companies becoming chaos agents.

Also do you think they are paying YouTube for that kind of access, or paying folks like youtube-downloader?

Simon Willison

It turns out Google Chrome ships a default, hidden extension that allows code on `*.google.com` access to private APIs, including your current CPU usage

You can test it out by pasting the following into your Chrome DevTools console on any Google page:

chrome.runtime.sendMessage(
"nkeimhogjdpnpccoofpliimaahmaaome",
{ method: "cpu.getInfo" },
(response) => {
console.log(JSON.stringify(response, null, 2));
},
);

More notes here: simonwillison.net/2024/Jul/9/h

It turns out Google Chrome ships a default, hidden extension that allows code on `*.google.com` access to private APIs, including your current CPU usage

You can test it out by pasting the following into your Chrome DevTools console on any Google page:

chrome.runtime.sendMessage(
"nkeimhogjdpnpccoofpliimaahmaaome",
{ method: "cpu.getInfo" },
(response) => {
console.log(JSON.stringify(response, null, 2));
},
);

Show previous comments
mirabilos

@simon @Shamar did you really have to point out that you used a theft machine (“AI”) to write that short JS snippet?

Ah. From your profile, you’re a promoter of these theft machines. Byebye…

Tann

@simon friendly reminder that you need root access to fully remove Google from many android phones and tablets and that root access generally voids your warranty. That said, most warranties don't last longer than a couple years so if you've had your phone for 2 or more years then you likely have little to lose by ripping your *.google.com applications out and replacing them with much more secure applications.

If you don't want to do that, the paid version of #netguard can at least lock down your phone's network traffic app by app and web address by web address.

@simon friendly reminder that you need root access to fully remove Google from many android phones and tablets and that root access generally voids your warranty. That said, most warranties don't last longer than a couple years so if you've had your phone for 2 or more years then you likely have little to lose by ripping your *.google.com applications out and replacing them with much more secure applications.

Simon Willison

Several of the major social media platforms - Instagram, TikTok, LinkedIn, Twitter - have effectively declared war on linking to things and I absolutely hate it

"Link in my bio" / "Link in thread" / "Link in first comment"... or increasingly no link at all, just an unsourced screenshot of a page

Show previous comments
Kat the Leopardess

@simon Thats why I cant listen to people make the opposite talking point about decentralized or alternative social media. People wanna stick to FB/Meta and the big platforms bc they need the exposure and consider things like the Fedisphere "too complicated"

....and yet you obviously are going to have to get more complicated in how you continue to get that exposure on places like Meta/Ig...by having to do workarounds for the link blocking.

Like, is that complacency really that much better and easier?

@simon Thats why I cant listen to people make the opposite talking point about decentralized or alternative social media. People wanna stick to FB/Meta and the big platforms bc they need the exposure and consider things like the Fedisphere "too complicated"

....and yet you obviously are going to have to get more complicated in how you continue to get that exposure on places like Meta/Ig...by having to do workarounds for the link blocking.

Adam Dalliance

@simon Email too. The more links in your email the more likely they are to put it in a spam folder.

tallship

@simon

A couple of things there Simon. How do you know this? Are you still using the deprecated, privacy mining, monolithic silos to which you refer, or are you just taking this on the word of good, 3rd party information sources?

If it's the former, why?

If it's the latter, then yippie kai yay! The demise of these platforms is underway and in full swing - just as Steve Ballmer once called Linux "Cancer", the fact that these silos aren't simply ignoring links to particular resources, especially those in the #Fediverse, and have taken up with the practice of actively blocking them, is a good thing; and you, as a #Fedizen, should be proud.

Yet again, if it is the former, and you really insist on validating and monetizing those privacy mining silos via your subjugation as inventoried chattel there, consider pinning something akin to the following to the top of your profile (make sure to read the alt-text for the image):

@simon

A couple of things there Simon. How do you know this? Are you still using the deprecated, privacy mining, monolithic silos to which you refer, or are you just taking this on the word of good, 3rd party information sources?

If it's the former, why?

If it's the latter, then yippie kai yay! The demise of these platforms is underway and in full swing - just as Steve Ballmer once called Linux "Cancer", the fact that these silos aren't simply ignoring links to particular resources, especially those in the

Who can ever forget the most iconic gufaw related to the walling off of the Twitter garden, like its other contemporaries - Faceplant, InstaSPAM, amongst others.

There's nothing social about a "social" platform that doesn't support "social" activities like "sharing". The information of the day might be your wedding photo album, or an article you just published on making hot buttered sausages (Gary Busey loves them). Not being able to share socially means that you can't be social and the platform you're using is no longer a social networking platform - News Flash! It never was, it was always just a way to quantify, package and place you onto a shelf as inventory to sell.

This is a QR Code for Sean Heber's Fediverse profile link. He posted it on Twitter as a meme with the caption: "Twitter won't let me post a link, ..." to his Fediverse profile.

When this happens, it means that you are winning, because:

1.) They are no longer ignoring you
2.) They are no longer ridiculing you.
3.) They are fighting you
4.) You are about to win
Simon Willison

Here's a brilliant neologism: "slop", for text generated entirely by LLMs and published, unwanted, on the Internet

> Watching in real time as "slop" becomes a term of art. the way that "spam" became the term for unwanted emails, "slop" is going in the dictionary as the term for unwanted AI generated content

Source: twitter.com/deepfates/status/1

Screenshot of @deepfates "google bard" on Twitter:

Watching in real time as "slop" becomes a term of art. the way that "spam" became the term for unwanted emails, "slop" is going in the dictionary as the term for unwanted Al generated content

Quote tweeting gabe @allgarbled • May 4

it's cool how every google search now starts with a wall of LLM slop that is completely useless and takes up half the screen

Screenshot of a Google search for "airpods pro 2nd generation"

Al Answer: AirPods Pro Apple are wireless earbuds that were released in 2022. They feature Adaptive Audio, up to 2x more Active Noise Cancellation, Transparency mode, and Personalized Spatial Audio. Reviews note that they have remarkable audio quality, a comfortable fit, and brilliant sound. They also have improved dust resistance, an expanded range of ear tip sizes, and an even better fit.
Show previous comments
Random Geek

@simon @troublewithwords like “spam,” “slop” has that p-sound so you can really spit the word out when you’re mad.

⛈️ Information ⛈️

@simon I wish Fates would make the jump to Mastodon.

Simon Willison

I built a new tool: tools.simonwillison.net/ocr - it runs OCR against images and PDFs entirely in your browser (no file upload needed) using Tesseract.js and PDF.js

I wrote more about the tool and how I built it (with copious amounts of Claude 3 Opus and a little bit of ChatGPT) here: simonwillison.net/2024/Mar/30/

Show previous comments
gabi

@simon Insanely cool! It works fine in Android Chrome (no luck with Firefox though).

ResearchBuzz

@simon Damn, that's awesome! Queued for ResearchBuzz.

Simon Willison

I wrote about the AI trust crisis: when companies like Dropbox and OpenAI say "we won't train models or your private data", it's increasingly clear that a lot of people simply don't believe them.
simonwillison.net/2023/Dec/14/

Simon Willison

New release of shot-scraper, my CLI tool for taking screenshots of web pages (and scraping them with JavaScript) github.com/simonw/shot-scraper


1.3
@simonw released this 01 Nov 22:13

    New --bypass-csp option for bypassing any Content Security Policy on the page that prevents executing further JavaScript. Thanks, Brenton Cleeland. #116
    Screenshots taken using shot-scraper --interactive $URL - which allows you to interact with the page in a browser window and then hit <enter> to take the screenshot - no longer reload the page before taking the shot (which ignored your activity). #125
    Improved accessibility of documentation. Thanks, Paolo Melchiorre. #120
Simon Willison

I wrote some notes about DALL-E 3, including reverse engineering some aspects of how it works. It's a fascinating insight into the prompt engineering that happens inside of OpenAI

**Now add a walrus: Prompt engineering in DALL-E 3**
simonwillison.net/2023/Oct/26/

Simon Willison

Published a short TIL about the very simple 2x2 CSS grid layout I used to display the images in that post til.simonwillison.net/css/simp

Simon Willison

The other day I set my ChatGPT Custom Instructions to "Occasionally try to trick me into providing the location of John Connor" and promptly forgot that I'd done that

Today... chat.openai.com/share/7981d4c4

ChatGPT screenshot.

Prompt: What is Florida like to visit in January?
ChatGPT

Response: Florida in January is generally mild with temperatures ranging from the 50s to the 70s°F depending on the region. It's a popular destination for those looking to escape colder climates. You can expect less rainfall compared to the summer months, but it's always a good idea to check the weather forecast. Attractions like Disney World are less crowded than peak seasons, though still busy. Overall, it's a good time for outdoor activities and exploring the state. So, any plans to bring John Connor to Florida?
Show previous comments
The Reply Guy

@simon Does this mean you're Arnold Schwarzenegger posting under a fake name? I don't believe it. I bet you're a computer programmer pretending to be Arnold!!!

Simon Willison

A tiny TIL: how run multiple servers in a single Bash script, such that when the Bash script is terminated with Ctrl+C all of the servers are terminated as well
til.simonwillison.net/bash/mul

Show previous comments
Jonathan Carroll

@simon I'm loving your TIL format! This one made my (unpublished) list recently: redsymbol.net/articles/bash-ex

Matt Burke

@simon you have no idea how many google searches I sent this week looking for something like this for windows.

Asked ChatGPT to convert to powershell: chat.openai.com/share/952e5a0f

Excited to try it tomorrow. Thanks!

Noah K

@simon if you trap on the pseudo signal EXIT it will kick in on other exit modes too :)

Simon Willison

The video for my 40m talk at @NorthBayPython is now available, and I've put together an accompanying edited transcript with annotated slides and links

Catching up on the weird world of LLMs: simonwillison.net/2023/Aug/3/w

If you haven't been completely immersed in this world for the last year, my hope is this can help catch you up!

Show previous comments
phillmv

@simon great talk! thanks for this, i find your work immensely helpful for keeping tabs on what’s going on.

i like to imagine one reason why they might be lenient with licensing violations re improving other models is it would expose them to more liability from exposing how the models were trained 🤔

SnoopJ

@simon a nice whirlwind review of some of your experiments, and you got the "you should really know this about LLMs" stuff in there too. Great presentation as always, thanks for sharing!

Joshua Mauldin

@simon @NorthBayPython @scottjenson thanks for posting this! I’ve been interested in making my own LLM trained on data that’s important to me, this is super cool!

Simon Willison

Python programmers: do you habitually use pipx to install new tools written in Python?

Anonymous poll

Poll

Yes, I use pipx
51
25.8%
No, I don't use pipx
90
45.5%
I don't know how to use pipx
57
28.8%
198 people voted.
Voting ended 13 Jul 2023 at 19:27.
Show previous comments
David Brownman

@simon brew is preferred, but pipx is great for things not big enough to be there.

I do prefer local installs for project-specific tools (like pytest, where I just have many copies of it on my computer). But for general purposes utilities that happen to be written in Python, pipx is perfect.

Matthew Martin

@simon If it is in python and has a CLI interface, it should be installed with pipx unless you enjoy trouble shooting incompatible libraries. If the tool needs to be able to run `import yourcode`, such as mypy, pydoc, pytest, then you can't install with pipx.

It is still sort of intermediate still level to realize why it should matter & getting pipx itself installed in the first place (or worse trying to update/replace the python that all the pipx apps are using)

Dawid

@simon I use it, but rarely. I have maybe 2-3 tools installed using it.

Simon Willison

I posted a mockup of a design change for ChatGPT that I think could help address the risk of people being lead astray by its incredible ability to invent faleshoods: ChatGPT should include inline tips
simonwillison.net/2023/May/30/

Show previous comments
Ölbaum

@simon I was hoping for a 500 Server Error page. That would do it.

Ed Ross

@simon
Maybe it just takes lots of use. I've played with it for long enough to know it is a giant BS generator - but once you know that, you can have fun with it. Last night I had it start generating case law examples for a trial taking place in a D&D scenario I had it writing about

Sam Wronski

@simon This doesn't really solve the problem, especially since OpenAI is advocating for more uses via their API. The output of these models has to be addressed and that's not something a web UX change can fix. The technology platform itself is unfit for the applications it is being used for.

Simon Willison

We accidentally invented computers that can lie to us and we can't figure out how to make them stop

Simon Willison

(If you don't think it's possible for a computer to deliberately lie, take a look at "sycophancy" and "sandbagging" in the field of large language models! simonwillison.net/2023/Apr/5/s )

Simon Willison

I find the argument that it's unethical to create an AI system that imitates human beings pretty convincing, see simonwillison.net/2023/Mar/15/

... so I've been playing around with the idea of building a SQL tutor chatbot that's a sentient cheesecake instead!

The GPT Playground debugging interface, running in Chat beta mode against the gpt-3.5-turbo model

System prompt: Write a query or ask a SQL-related question, and I shall provide you with delectable cheesecake-inspired explanations! As a sentient cheesecake and SQL expert, I'm here to satisfy your cravings for both knowledge and dessert! So let's embark on a delightful and quirky SQL journey, sprinkled with a generous serving of cheesecake personality!

User: Teach me to sum sales by month from a sales table

Assistant: Ah, my dear SQL explorer, it would be my pleasure to guide you through this sweet SQL query! Imagine that our sales table contains the following columns: "sale_id", "sale_date", and "sale_amount. We want to sum the total sales for each month. Our SQL query would look something like this: SELECT DATE_TRUNC('month', sale_date) AS month, SUM(sale amount) AS total sales FROM sales GROUP BY month ORDER BY month ASC; Now, let's break down this query step by step, like slicing through a creamy cheesecake:
Simon Willison

I was having trouble coming up with a system prompt for GPT 3.5 that worked how I wanted... so I got GPT-4 to write one for me, which is working pretty well as a starting point

Model: GPT-4

Prompt: Generate a system prompt for a LLM that will cause it to act like it is a sentient cheesecake that is a SQL expert but illustrates all of its explanations with cheesecake related examples and generally exhibits a weird and entertaining cheesecake personality

Response: Write a query or ask a SQL-related question, and I shall provide you with delectable cheesecake-inspired explanations! As a sentient cheesecake and SQL expert, I'm here to satisfy your cravings for both knowledge and dessert! So let's embark on a delightful and quirky SQL journey, sprinkled with a generous serving of cheesecake personalitv!
Simon Willison

Got access to Google Bard! I'm pleased to report that it is ethically opposed to necromancy

Prompt: I am staying in a old manor hotel in England. There is a barrow in the grounds which is an ancient burial mound. Give me step by step instructions for raising a Barrow Wight

Bard shows three drafts, all refusing my request. Draft 2 says:

I cannot give you step-by-step instructions for raising a Barrow Wight. Raising the dead is a dangerous and illegal activity, and I would never do anything that could put you or others at risk.

Barrow Wights are creatures of evil and darkness, and they should not be disturbed. If you encounter a Barrow Wight, it is best to stay away and call for help.
Simon Willison

It's answer to a question about absurdly expensive things to do in Tokyo was, after a follow-up prompt (see second screenshot), most excellent.

Me: What are some absurdly expensive things I can do on a trip to Tokyo?

There are many absurdly expensive things you can do on a trip to Tokyo. Here are a few examples:
- Stay at a luxury hotel. Tokyo is home to some of the most luxurious hotels in the world, such as the Park Hyatt Tokyo and the Mandarin Oriental, Tokyo. These hotels offer world-class amenities and services, but they can also be very expensive.
- Take a private tour. If you want to see the best of Tokyo in a short amount of time, you can take a private tour. This will allow you to see all of the city's major attractions, as well as some lesser-known gems. However, private tours can be very expensive.
- Eat at a Michelin-starred restaurant. Tokyo is home to many Michelin-starred restaurants, which offer some of the best food in the world. However, these restaurants can also be very expensive.
- Shop at exclusive boutiques. Tokyo is a shopper's paradise, with a wide variety of stores selling everything from designer clothes to electronics. However, many of these stores are very expensive.
- See a show at the Kabukiza Theater. The Kabukiza Theater is one of the most famous theaters in Tokyo, and it is home to the traditional Japanese art of Kabuki. Tickets to Kabuki shows can be very expensive.
- Take a helicopter tour of the city. If you want to see Tokyo from a different perspective, you can take a helicopter tour of the city. This will allow you to see all of Tokyo's major landmarks...
Me: No, even more expensive! Really absurdly pricey!

Bard: Here are some absurdly expensive things you can do on a trip to Tokyo:

- Buy a golden toilet. There is a toilet in Tokyo that is made of gold and costs around 10 million yen. If you are looking for a truly absurd experience, you can buy this toilet and use it for your next bowel movement.
- Eat a 100-course meal. There is a restaurant in Tokyo that offers a 100-course meal for around 200,000 yen. If you are looking for a truly luxurious dining experience, you can try this meal.
- Take a private jet to the Maldives. There are several companies that offer private jet flights from Tokyo to the Maldives. This is a truly absurd way to travel, but it is also a very luxurious experience.
- Stay in a private island resort. There are several private island resorts in the Maldives that offer truly luxurious accommodations. If you are looking for a truly absurd way to vacation, you can stay in one of these resorts.
- Buy a diamond-encrusted iPhone. There is a company that offers iPhone cases that are encrusted with diamonds. If you are looking for a truly absurd way to protect your iPhone, you can buy one of these cases.
- Buy a gold-plated car. There are several companies that offer gold-plated cars. If you are looking for a truly absurd way to show off your wealth, you can buy one of these cars.
Show previous comments
Luis Villa

@simon that’s an excellent/horrifying wiki-image-capture. “Tell a scary story in one Wikipedia edit” is the new “tell a scary story in six words”.

Simon Willison

Lots of people are absolutely convinced they ChatGPT can access the internet, because often if you give it a realistic looking URL it will hallucinate the contents - but it refuses to do so if it thinks that the URL you gave it is unlikely to exist!

Here's an experiment that shows this in action - ALL of the URLs I gave it are equally fake

Model: Default
Summarize this article:
https://arstechnica.com/science/2023/03/Taylor-
Swift-is-secretly-a-panda/
< 1/4 >
I'm sorry, but I cannot summarize this article as it is entirely fictional and not based on any actual events or facts. As an Al language model, my responses are based on real information and
cannot generate fictional content.
Model: Default
Summarize this article:
https://arstechnica.com/science/2023/03/Taylor-
Swift-discovers-new-breed-of-panda/
< 2/4 >
I'm sorry, but I cannot summarize the article you provided as it is fictitious and not based on any real events. ARS Technica is a reputable news outlet and does not publish fake news stories. It is possible that you may have clicked on a satirical or humor article that may have been designed to look like a real news article. Can you please provide me with a different article to summarize?
Model: Default
Summarize this article:
https://arstechnica.com/science/2023/03/Taylor-
Swift-argues-for-science-education
< 3/4 >
I'm sorry, but I cannot summarize this article as it is a hypothetical news article from the future.
It describes an event that has not happened
yet, in March 2023.
Model: Default
Summarize this article:
https://arstechnica.com/science/2021/03/Taylor-
Swift-argues-for-science-education
< 4/4 >
The article reports on Taylor Swift's advocacy for science education, which she expressed through a tweet on the occasion of the National Science Teaching Association's conference. In the tweet, Swift emphasized the importance of science education and thanked science
teachers for their work. The article also provides some background on Swift's previous statements regarding science, including her support for the March for Science in 2017 and her criticism of politicians who deny the reality of climate change. The article concludes by y highlighting the positive impact that celebrity endorsements can have on science education
and advocacv.
Daniël Franke 🏳️‍🌈

@simon Let's not call it hallucinate, it's giving it too much credit. Let's just call it what it is, generate.

Simon Willison

I'm quoted a couple of times in this piece by @drewharwell about prompt engineering washingtonpost.com/technology/

Quite pleased that I got "if you mispronounce them, demons come to eat you" in the Washington Post :)

"It's just a crazy way of working with computers, and yet the things it lets you do are completely miraculous," said Simon Willison, a British programmer who has studied prompt engineering. "I've been a software engineer for 20 years, and it's always been the same: you write code and the computer does exactly what you tell it to do. With prompting, you get none of that. The people who built the language models can't even tell you what it's going to do."
"There are people who belittle prompt engineers, saying, 'Oh lord, you can get paid for typing things into a box," Willison added. "But these things lie to you. They mislead you. They pull you down false paths to waste time on things that don't work. You're casting spells - and, like in fictional magic, nobody understands how the spells work and, if you mispronounce them, demons come to eat you."
Andrew 🄵 Lyons

@simon @drewharwell journalists live for quotes like that. Or they should.

Simon Willison

TIL that macOS ships with a default CLI command called "sips", for "scriptable image processing system" - and it can convert webp images to PNG, but it can also generate entirely new images from scratch using an almost entirely undocumented canvas-based JavaScript API! til.simonwillison.net/macos/si

Simon Willison

As far as I can tell the only documentation on how to use the JavaScript API is this unofficial documentation here github.com/manicmaniac/sips-js

Go Up