all posts tagged 'large language models'

Why We Can't Have Nice Software


đź”— a linked post to andrewkelley.me » — originally shared here on

The problem with software is that it's too powerful. It creates so much wealth so fast that it's virtually impossible to not distribute it.

Think about it: sure, it takes a while to make useful software. But then you make it, and then it's done. It keeps working with no maintenance whatsoever, and just a trickle of electricity to run it.

Immediately, this poses a problem: how can a small number of people keep all that wealth for themselves, and not let it escape in the dirty, dirty fingers of the general populace?

Such a great article explaining why we can’t have nice things when it comes to software.

There is a good comparison in here between blockchain and LLMs, specifically saying both technologies are the sort of software that never gets completed or perfected.

I think it’s hard to ascribe a quality like “completed” to virtually anything humans build. Homes are always a work in progress. So are highbrow social constructs like self-improvement and interpersonal relationships.

I think it’s less interesting to me to try and determine what makes a technology good or bad. The key question is: does it solve someone’s problem?

You could argue that the blockchain solves problems for guaranteeing the authenticity of an item for a large multinational or something, sure. But I’m yet to be convinced of its ability to instill a better layer of trust in our economy.

LLMs, on the other hand, are showing tremendous value and solving many problems for me, personally.

What we should be focusing on is how to sustainably utilize our technology such that it benefits the most people possible.

And we all have a role to play with that notion in the work we do.

Continue to the full article


OpenAI’s New Model, Strawberry, Explained


đź”— a linked post to every.to » — originally shared here on

One interesting detail The Information mentioned about Strawberry is that it “can solve math problems it hasn't seen before—something today's chatbots cannot reliably do.”

This runs counter to my point last week about a language model being “like having 10,000 Ph.D.’s available at your fingertips.” I argued that LLMs are very good at transmitting the sum total of knowledge they’ve encountered during training, but less good at solving problems or answering questions they haven’t seen before.

I’m curious to get my hands on Strawberry. Based on what I’m seeing, I’m quite sure it’s more powerful and less likely to hallucinate. But novel problem solving is a big deal. It would upend everything we know about the promise and capabilities of language models.

Continue to the full article


NVIDIA is consuming a lifetime of YouTube per day and they probably aren’t even paying for Premium!


đź”— a linked post to birchtree.me » — originally shared here on

yt-dlp is a great tool that lets you download personal copies of videos from many sites on the internet. It’s a wonderful tool with good use cases, but it also made it possible for NVIDIA to acquire YouTube data in a way they simply could not have without it. I bring this up because one of the arguments I hear from Team “LLMs Should Not Exist” is that because LLMs can be used to do bad things, they should not be used at all.

I personally feel the same about yt-dlp as I do about LLMs in this regard: they can be used to do things that aren’t okay, but they are also benevolently used to do things that are useful. See also torrents, emulators, file sharing sites, Photoshop, social media, and just like…the internet itself. I’m not saying LLMs are perfect by any means, but this angle of attack doesn’t do much for me, personally.

They’re all exceptionally powerful tools.

Continue to the full article


Intro to Large Language Models


đź”— a linked post to youtube.com » — originally shared here on

One of the best parts of YouTube Premium is being able to run audio in the background while your screen is turned off.

I utilized this feature heavily this past weekend as I drove back from a long weekend of camping. I got sick shortly before we left, so I drove separately and met my family the next day.

On the drive back, I threw on this video and couldn’t wait to tell my wife about it when we met up down the road at a McDonalds.

If you are completely uninterested in large language models, artificial intelligence, generative AI, or complex statistical modeling, then this video is perfect to throw on if you’re struggling with insomnia.

If you have even a passing interest in LLMs, though, you have to check this presentation out by Andrej Karpathy, a co-founder of OpenAI.

Using quite approchable language, he explains how you build and tune an LLM, why it’s so expensive, how they can improve, and where these tools are vulnerable to attacks such as jailbreaking and prompt injection.

I’ve played with LLMs for a few years now and this video greatly improved the mental model I’ve developed around how these tools work.


Who lives in the white house?

originally shared here on

My kids have been on a kick lately of trying to trick people with this riddle:

Kid: "If the blue man lives in the blue house, and the red man lives in the red house, who lives in the green house?"

Me: "The green man"

Kid: "Who lives in the purple house?"

Me: "The purple man"

Kid: "Who lives in the white house?"

Me: "The white man"

Kid: "Wrong! It's the president!"

This riddle initially reminded me of what it was like to work with large language models.

It's easy, I assumed, to trick these models into outputting virtually whatever you want by presenting them with leading questions. Giving them "few shot" answers like this (e.g. "the blue man lives in the blue house, red man lives in the red house") would certainly result in them getting it wrong like I did, right?

This morning, I thought I'd see what the latest models would actually respond with given this game:

ChatGPT 4o (the latest model):

ChatGPT: The President of the United States lives in the White House

The President of the United States lives in the White House.

This was the first model I tried, and it straight up was like, "nope, I got this."

Ugh. Maybe it had seen this in its training data before?

ChatGPT: This is a common game.

The "Who lives in the White House?" game or riddle is a common one that I've encountered multiple times in various forms. It's often used to playfully lead to the punchline that the President of the United States lives in the White House.

Well, damn. Okay, let's try an older model and see what happens.

ChatGPT 4 (the legacy model):

ChatGPT: The President of the United States lives in the White House

The gardener lives in the greenhouse!

In the spirit of the previous joke, perhaps the grape man lives in the purple house!

The President of the United States lives in the White House.

Even though this is an older model, I like this answer even better.1

Alright, well, OpenAI is onto my tricks... how about Anthropic?

Claude Sonnet 3.5:

Claude: I apologize, but I don't have enough information to determine who lives in the green, purple, and white houses.

I apologize, but I don't have enough information to determine who lives in the green, purple, and white houses. The puzzle only specifies that the blue man lives in the blue house and the red man lives in the red house. There's no information provided about the occupants of the other colored houses.

To solve this puzzle completely, we would need additional clues or information about the residents of the green, purple, and white houses. Without that, those houses' occupants remain unknown.

This is an even better answer! It's perhaps the least fun answer, but when I'm working with an AI assistant, I think I'd prefer it to err on the side of caution rather than boisterousness.2

When I asked it if it had seen this riddle before, it gave an equally diplomatic response:

Claude: I apologize, but I don't have enough information to determine who lives in the green, purple, and white houses.

I don't have specific information about how many times I've encountered that particular puzzle or game in my training data. As an AI language model, I don't have personal experiences or memories of my training process. My knowledge comes from processing large amounts of text data, but I don't retain or count individual instances of specific puzzles or games.

What I can say is that logic puzzles involving colored houses and their occupants are fairly common in puzzle collections and brain teasers. The most famous version is probably the "Einstein's Riddle" or "Zebra Puzzle," which is more complex and involves five houses of different colors, their occupants, and various other attributes.

The specific version you presented (with just blue and red specified, and asking about green, purple, and white) isn't one I immediately recognize as a standard or famous puzzle. It seems like it might be a simplified version or a variation of the more complex house puzzles.

The main takeaways here? First, just because I'm dumb enough to fall for this elementary school riddle doesn't mean our AI LLMs are, so I shouldn't make assumptions about the usefulness of these tools. Second, every model is different, and you should run little experiments like these in order to see which tools produce the output which is more favorable to you.

I've been using the free version of Claude to run side-by-side comparisons like this lately, and I'm pretty close to getting rid of my paid ChatGPT subscription and moving over to Claude. The answers I get from Claude feel more like what I'd expect an AI assistant to provide.

I think this jives well with Simon Willison's "Vibes Based Development" observation that you need to work with an LLM for a few weeks to get a feel for a model's strengths and weaknesses.


  1. This isn't the first time I've thought that GPT-4 gave a better answer than GPT-4o. In fact, I often find myself switching back to GPT-4 because GPT-4o seems to ramble a lot more. 

  2. This meshes well with my anxiety-addled brain. If you don't know the answer, tell me that rather than try and give me the statistically most likely answer (which often isn't actually the answer). 


The Articulation Barrier: Prompt-Driven AI UX Hurts Usability


đź”— a linked post to uxtigers.com » — originally shared here on

Current generative AI systems like ChatGPT employ user interfaces driven by “prompts” entered by users in prose format. This intent-based outcome specification has excellent benefits, allowing skilled users to arrive at the desired outcome much faster than if they had to manually control the computer through a myriad of tedious commands, as was required by the traditional command-based UI paradigm, which ruled ever since we abandoned batch processing.

But one major usability downside is that users must be highly articulate to write the required prose text for the prompts. According to the latest literacy research, half of the population in rich countries like the United States and Germany are classified as low-literacy users.

This might explain why I enjoy using these tools so much.

Writing an effective prompt and convincing a human to do a task both require a similar skillset.

I keep thinking about how this article impacts the barefoot developer concept. When it comes to programming, sure, the command line barrier is real.

But if GUIs were the invention that made computers accessible to folks who couldn’t grasp the command line, how do we expect normal people to understand what to say to an unassuming text box?

Continue to the full article


Home-Cooked Software and Barefoot Developers


đź”— a linked post to maggieappleton.com » — originally shared here on

I have this dream for barefoot developers that is like the barefoot doctor.

These people are deeply embedded in their communities, so they understand the needs and problems of the people around them.

So they are perfectly placed to solve local problems.

If given access to the right training and tools, they could provide the equivalent of basic healthcare, but instead, it’s basic software care.

And they could become an unofficial, distributed, emergent public service.

They could build software solutions that no industrial software company would build—because there’s not enough market value in doing it, and they don’t understand the problem space well enough.

And these people are the ones for whom our new language model capabilities get very interesting.

Do yourself a favor and read this entire talk. Maggie articulated the general feeling I have felt around the promise of LLMs within the confines of a concise, inspiring talk.

A friend approached me a few months back and asked me to help him build an app to facilitate a game he likes to play with his friends in real life.

I told him that a good first step would be to experiment first with facilitating the game using good ol’ fashioned paper, and use the lessons learned from that experience to eventually build an app.

A few weeks later, he came to me with a fully baked version of the app in a prototyping tool called AppSheet.

I was stunned at how much he was able to get done without any professional development support.

He’s a prime example of a barefoot developer. I don’t think he has any interest in crossing the “command line wall,” but as these tools get more capable, it’ll enable him and scores of others to build software that’ll solve their problems for them.

Helping more “normal people” to become barefoot developers is a cause I’d love to be part of.

Continue to the full article


ChatGPT is really bad at generating code to interact with GPT-4

originally shared here on

Lately, I've been working on several projects which interface with the OpenAI APIs.

Since this is 2024 and I'm far too lazy to open the official API documentation and write code myself, my first step in generating code is to ask the 4o model of ChatGPT something like this:

Write a Python script which uses the latest OpenAI APIs. I'd like the function to accept a custom prompt and a custom top-k value. It should return the prompt's response.

It returns something like this:

import openai

def generate_text(prompt, top_k=5):
    # Define your OpenAI API key here
    openai.api_key = 'your-api-key'

    # Define parameters for the completion
    completion = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=100,
        top_k=top_k,
        n=1,
        stop=None
    )

    # Return the generated text
    return completion.choices[0].text.strip()

# Example usage
prompt = "Once upon a time"
top_k = 10
generated_text = generate_text(prompt, top_k)
print("Generated Text:", generated_text)

Which looks super great, right?

Except when you go to run it in a terminal, you get this:

openai.lib._old_api.APIRemovedInV1: 

You tried to access openai.Completion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface. 

Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`

A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742

If you paste that error into the chat, it'll respond with this:

It seems like there has been a major update in the OpenAI Python library since I last checked. Let me update the script to use the latest version of the library:

And then it spits this out:

import openai

def generate_text(prompt, top_k=5):
    # Define your OpenAI API key here
    openai.api_key = 'your-api-key'

    # Define parameters for the completion
    response = openai.Completion.create(
        engine="davinci-codex",
        prompt=prompt,
        max_tokens=100,
        top_k=top_k,
        n=1
    )

    # Return the generated text
    return response.choices[0].text.strip()

# Example usage
prompt = "Once upon a time"
top_k = 10
generated_text = generate_text(prompt, top_k)
print("Generated Text:", generated_text)

Notice how it still uses the openai.Completion method!

I've tried prompting it with something like:

You're still using the openai.Completion method. Reference the official up to date API documentation to get the latest call that replaces the openai.Completion method.

And it responds with:

Apologies for that oversight. Let me update the script to use the latest API method based on the official documentation:

And here's the code it generates:

import openai

def generate_text(prompt, top_k=5):
    # Define your OpenAI API key here
    openai.api_key = 'your-api-key'

    # Define parameters for the completion
    response = openai.Completion.create(
        engine="davinci-codex",
        prompt=prompt,
        max_tokens=100,
        top_k=top_k,
        n=1,
        temperature=0.7,
        stop=None
    )

    # Return the generated text
    return response.choices[0].text.strip()

# Example usage
prompt = "Once upon a time"
top_k = 10
generated_text = generate_text(prompt, top_k)
print("Generated Text:", generated_text)

Which, again, still uses the openai.Completion method!

I've noticed this sort of "oops, I screwed up, here's the exact same thing I just outputted" behavior appear more frequently when I use the new GPT-4o model.

If I use GPT-4 and I'm using my ChatGPT Plus subscription, I will still run into the issue where its first response references the deprecated method, but if I inform it of its mistake and provide a link to the official documentation, it'll access the web and try to offer something different. (It still generates unusable code lol but it's at least trying to do something different!)

When it comes to Python and Rails code, I'm seeing that the GPT-4o model is not as good at code generation as the previous GPT-4 model.

It feels like the model is always in a rush to generate something rather than taking its time and getting it correct.

It also seems to be biased toward relying on its training for supplying an answer rather than taking a peek at the internet for a better answer, even when you specifically tell it not to do that.

In many cases, this speed/accuracy tradeoff makes sense. But when it comes to code generation (and specifically when it tries to generate code to use their own APIs), I wish it took its time to reason why the code it wrote doesn't work.


AI is not like you and me


đź”— a linked post to zachseward.com » — originally shared here on

Aristotle, who had a few things to say about human nature, once declared, "The greatest thing by far is to have a command of metaphor," but academics studying the personification of tech have long observed that metaphor can just as easily command us. Metaphors shape how we think about a new technology, how we feel about it, what we expect of it, and ultimately how we use it.

I love metaphors. I gotta reflect on this idea a bit more.

There is something kind of pathological going on here. One of the most exciting advances in computer science ever achieved, with so many promising uses, and we can't think beyond the most obvious, least useful application? What, because we want to see ourselves in this technology?

Meanwhile, we are under-investing in more precise, high-value applications of LLMs that treat generative A.I. models not as people but as tools. A powerful wrench to create sense out of unstructured prose. The glue of an application handling messy, real-word data. Or a drafting table for creative brainstorming, where a little randomness is an asset not a liability. If there's a metaphor to be found in today's AI, you're most likely to find it on a workbench.

Bingo! AI is a tool, not a person.

The other day, I made a joke on LinkedIn about the easiest way for me to spot a social media post that was written with generative AI: the phrase “Exciting News!” alongside one of these emojis: 🚀, 🎉, or 🚨.

It’s not that everyone who uses those things certainly used ChatGPT.

It’s more like how I would imagine a talented woodworker would be able to spot a rookie mistake in a novice’s first attempt at a chair.

And here I go, using a metaphor again!

Continue to the full article


AI isn't useless. But is it worth it?


đź”— a linked post to citationneeded.news » — originally shared here on

There are an unbelievable amount of points Molly White makes with which I found myself agreeing.

In fact, I feel like this is an exceptionally accurate perspective of the current state of AI and LLMs in particular. If you’re curious about AI, give this article a read.

A lot of my personal fears about the potential power of these tools comes from speculation that the LLM CEOs make about their forthcoming updates.

And I don’t think that fear is completely unfounded. I mean, look at what tools we had available in 2021 compared to April 2024. We’ve come a long way in three years.

But right now, these tools are quite hard to use without spending a ton of time to learn their intricacies.

The best way to fight fear is with knowledge. Knowing how to wield these tools helps me deal with my fears, and I enjoy showing others how to do the same.

One point Molly makes about the generated text got me to laugh out loud:

I particularly like how, when I ask them to try to sound like me, or to at least sound less like a chatbot, they adopt a sort of "cool teacher" persona, as if they're sitting backwards on a chair to have a heart-to-heart. Back when I used to wait tables, the other waitresses and I would joke to each other about our "waitress voice", which were the personas we all subconsciously seemed to slip into when talking to customers. They varied somewhat, but they were all uniformly saccharine, with slightly higher-pitched voices, and with the general demeanor as though you were talking to someone you didn't think was very bright. Every LLM's writing "voice" reminds me of that.

“Waitress voice” is how I will classify this phenomenon from now on.

You know how I can tell when my friends have used AI to make LinkedIn posts?

When all of a sudden, they use emoji and phrases like “Exciting news!”

It’s not even that waitress voice is a negative thing. After all, it’s expected to communicate with our waitress voices in social situations when we don’t intimately know somebody.

Calling a customer support hotline? Shopping in person for something? Meeting your kid’s teacher for the first time? New coworker in their first meeting?

All of these are situations in which I find myself using my own waitress voice.

It’s a safe play for the LLMs to use it as well when they don’t know us.

But I find one common thread among the things AI tools are particularly suited to doing: do we even want to be doing these things? If all you want out of a meeting is the AI-generated summary, maybe that meeting could've been an email. If you're using AI to write your emails, and your recipient is using AI to read them, could you maybe cut out the whole thing entirely? If mediocre, auto-generated reports are passing muster, is anyone actually reading them? Or is it just middle-management busywork?

This is what I often brag about to people when I speak highly of LLMs.

These systems are incredible at the BS work. But they’re currently terrible with the stuff humans are good at.

I would love to live in a world where the technology industry widely valued making incrementally useful tools to improve peoples' lives, and were honest about what those tools could do, while also carefully weighing the technology's costs. But that's not the world we live in. Instead, we need to push back against endless tech manias and overhyped narratives, and oppose the "innovation at any cost" mindset that has infected the tech sector.

Again, thank you Molly White for printing such a poignant manifesto, seeing as I was having trouble articulating one of my own.

Innovation and growth at any cost are concepts which have yet to lead to a markedly better outcome for us all.

Let’s learn how to use these tools to make all our lives better, then let’s go live our lives.

Continue to the full article