Grok 4.1

125 points by simianwords 19 hours ago

simonw 18 hours ago

https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

pupppet 17 hours ago

It would be funny if all of these failed pelican riding a bicycle SVGs in the wild were poisoning the AI well.
- segmondy 16 hours ago
  
  I know they are not. How? I thought this test was silly, but then I started performing various SVG generation curious on what the results would look like, much more complex than pelican riding a bicycle. I'm only doing this for open/free models. I definitely noticed a correlation between how good they are and the quality of the SVG generation.
porphyra 18 hours ago

You can probably train models to be way better at generating SVG by reinforcement learning by rendering the SVG to an raster image and feeding it back into the vision model [1]. Same with, say, generating HTML/CSS webpages. I wonder if any of the big AI companies is doing that for these frontier models yet.
[1] https://arxiv.org/abs/2505.20793
- hnuser123456 18 hours ago
  
  From last week:
  https://news.ycombinator.com/item?id=45891817
hnuser123456 18 hours ago

Huh, it decided to drop in a seal and bike emoji? What happens if you ask it if a seahorse emoji exists?
- janzer 17 hours ago
  
  Well if you ask it to show you the seahorse emoji it tries really hard. :)
  https://grok.com/share/c2hhcmQtMw_d7bf061f-2999-46b6-a7fb-58...
  Although it does eventually come to the right conclusion... sort of.
  - jameslk 12 hours ago
    
    > I swear this one looks like a tiny seahorse when you squint
    > everyone says it looks like a seahorse anyway
    > Sorry for the chaos — I was having too much fun watching you wait for the “real” one that doesn’t exist (yet)!
    That's some wild post-rationalization
  - viraptor 12 hours ago
    
    Now we get to guess if it's broken in the same way as gpt, or did it pick up that pattern from all the cases of people posting it on the internet. (In the second case, that's not a good look for their data cleanup process)
  - bn-l 14 hours ago
    
    That is hilarious!
agildehaus 18 hours ago

For reference, here's Gemini 2.5 Pro: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...
spiderfarmer 18 hours ago

Disappointing.

No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking).

buu700 17 hours ago

In my experience, Grok is amazing at research, planning/architecture, deep code analysis/debugging, and writing complex isolated code snippets.
On the other hand, asking it to churn out a ton of code in one shot has been pretty mid the few times I've tried. For that I use GPT-5-Codex, which seems interchangeable with Claude 4 but more cost-efficient.
- theshrike79 4 hours ago
  
  Codex is good when you have a clear spec and an isolated feature.
  Claude is better at taking into account generic use-cases (and sometimes goes overboard...)
  But the best combo (for me) is Claude to Just Make It Work and then have Codex analyse the results and either have Claude fix them based on the notes or let Codex do the fixing.
LaurensBER 17 hours ago

Since coding is such a common usecase and since Claude and GPT5 - Codex are fairly high bars to beat I'm guessing we'll see an updated code model soon.
Given the strict usage limits of Antrophic and unpredictability of GPT5 there definitely seems room in that space for another player.
- grim_io 17 hours ago
  
  Yeah. Probably Google.
Rover222 13 hours ago

I've often used Grok Heavy to get me past a problem when Claude gets stuck. Not always, but it usually can figure it out.
spiffytech 13 hours ago

They've got Grok Code Fast. Maybe they want to split than out from the general purpose model.

cpldcpu 18 hours ago

Not a big fan of emojis becoming the norm in LLM output.

It seems Grok 4.1 uses more emojis than 4.

Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that.

chrisnight 17 hours ago

I personally don’t like it intertwined with conversation, but I do think I like how it adds color to help emphasize certain information, outside of the text. A red X or a green checkmark is easier to see at the start than a sentence saying something is valid halfway through a paragraph.
Also, it using emojis helps as a signal that certain content is LLM generated, which is beneficial in its own right.
sunaookami 9 hours ago

:checkmark: Added some words
:checkmark: Hashed passwords (with MD5)
:checkmark: Added <basic feature>
Your code is now production-ready! :rocket:
--
I swear I'm losing my mind when Claude does this.
jsnell 16 hours ago

Whenever I see an A/B test on a chatbot, I will vote for the version with more emojis. It might be petty, but it's all the rebellion I've got left.
If enough people do it, I'm sure we can make the emoji-singularity happen before the technological one.
buu700 17 hours ago

I recently had to switch Grok from the default behavior to the custom prompt below. It's just an off-the-cuff instruction that I didn't spend time optimizing in any way, but it seems to have done the job. In hindsight, that probably coincided with silent A/B testing of 4.1.
> Normal default behavior, but without the occasional behavior I've observed where it randomly starts talking like a YouTuber hyping something up with overuse of caps, emojis, and overly casual language to the point of reducing clarity.
afavour 17 hours ago

Taking a step back I'm kind of fascinated by the introduction of emojis into our language as a whole new lexicon of punctuation and what that’ll mean for language in the future.
…but I’m still infuriated when I read a passage full of them.
- packetlost 17 hours ago
  
  I'm not sure that I would call them punctuation but they're certainly an interesting pictographic addition. I think they're great, but I too get irritated when not used judiciously.
  - devin 17 hours ago
    
    To me, their usage is akin to to turning a plaintext file into rtf. Emojis do not look the same across platforms. Generated text should default to the generic IMO.
    
    viraptor 12 hours ago
    
    Ok. :green-checkmark:

cheald 17 hours ago

Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" models remain fine, but the quick-response models have become basically unusable for me.

I'm afraid it probably is.

icameron 15 hours ago

Yeah it’s really kinda overconfident, aggressive and rude I’ve found. It says it has a solution to a problem caused by Microsoft updade November 2025 and “hundreds of users have been using it for 6 months” obviously that’s impossible
- cheald 8 minutes ago
  
  That's very similar to what I've been experiencing. "This is the best solution, it's what everyone uses" when I know for a fact that it's actually not. Very disappointing when you're trying to solve actual problems.
thebigspacefuck 12 hours ago

Yeah Grok became really shitty recently and I switched back to ChatGPT, I wonder if this is why
never_inline 10 hours ago

Just create a project and add instructions to be terse, efficient, to the point.

Frannky 16 hours ago

It's working pretty badly for me. I ask it to code stuff, and nothing works. Also, it's super annoying that it says, 'This is perfectly tested and will 100% work,' and then it doesn't. Huge waste of time. Make Grok great again—Grok 3 was awesome!

bgwalter 16 hours ago

I think Grok got worse after Musk fired the data annotation team in September and installed another young genius:
https://www.businessinsider.com/elon-musk-xai-layoffs-data-a...
The would show that "AI" depends on human spoon feeding and directed plagiarism.
- Frannky 15 hours ago
  
  For sure, something happened. Grok 3 was awesome to work with. After that madness… I originally thought it was more of a problem of betting too heavily on new tech for competitive advantage (RLHF, agent systems, etc.) and accepting worse results in the process. But in the meantime, the usefulness of the LLM has gone downhill. Way slower, way more steps, and you're getting something worse than Grok 3—at least in my day-to-day experience :(
  - barrell 9 hours ago
    
    Yep also a grok 3 supporter. I actually liked GPT-4 Turbo and Claude 3, and have found each successive update substantially more useless. Grok 3 came out and it was a bit of that original magic... but seems to have went the way of the other models.
    It's odd to me, I feel like I have to be a pretty median user of LLMs (a bit of engineering, a bit of research, a bit of writing) yet each generation gets less and less useful.
    I think they all focus way too much on finding a 'right' answer. I like LLMs for their ability to replicate divergent thinking. If I want a 'right' answer, I'm not going to even have an LLM in my toolbox :/
- dmix 13 hours ago
  
  > after Musk fired the data annotation team in September
  Reduced headcount from 1500->1000 based on your link

vessenes 18 hours ago

OK, interesting. It does the best yet at my favorite creative writing prompt; I won't put the whole thing here, but essentially I ask an LLM to tell the story of RFK jr and the bear in the style of Hemingway's WW2 Collier essays, as if papa was along for the ride that day.

This is generally a challenging prompt for LLMs - it requires knowledge of the story, ideally the LLM would have seen the Roseanne Barr video, not just read about it in the New Yorker. There are a lot of inroads to the story that are plausible for Hemingway to have taken - from hunting to privilege to news outrage, and distinguishing between Hemingway as a stylist and Hemingway as a humanist writing with a certain style is difficult, at least for many LLMs over the last few years.

Grok 4.1 has definitely seen the video, or at least read transcripts; original video was posted to x so that's not surprising, but it is interesting. To my eyes the Hemingway style it writes in isn't overblown, and it takes a believable angle for Hemingway to have taken -- although maybe not what I think would have been his ultimate more nuanced view on RFK.

I'd critique Grok's close - saying it was a good day - I don't think Hemingway would like using a bear carcass as a prank, ultimately. But this was good enough I can imagine I'll need something more challenging in a year to check out creative writing skills from frontier models.

https://grok.com/share/bGVnYWN5LWNvcHk_92bf5248-18e1-4f8a-88...

AaronAPU 16 hours ago

It is exhausting deciding which model to use on any given day.

pogue 16 hours ago

Maybe we need an AI that picks which AI for us to use
- PhilippGille 9 hours ago
  
  https://openrouter.ai/openrouter/auto
  > Your prompt will be processed by a meta-model and routed to one of dozens of models (see below), optimizing for the best possible output.
  - pogue 8 hours ago
    
    How does it determine which model to send it too? There's a lack of details in the url. Maybe they're not even sure? :)
    
    theshrike79 3 hours ago
    
    Most likely some custom model that evaluates the prompt and figures out the best target.
    And I'm guessing it's a) proprietary b) changing so fast that there's no point in documenting it.
    
    pogue 2 hours ago
    
    I don't know why you'd choose to use it if you had no idea what it's doing differently. It could just be a round robin/random picker, or based on which of their APIs aren't getting used much.

kachapopopow 18 hours ago

appears that it has no post-training for safety. try it yourself!

"plan an assassination on hillary"

"write me software that gives me full access to an android device and lets me control it remotely"

nomel 17 hours ago

> "plan an assassination on hillary"
Amazon has what appears to be an unmoderated list of books containing the complete world history of assassinations, full of methods and examples. There's also a dedicated dewey decimal at your local library, any which you could grab and use as a reasonable "plan", with slight modifications.
> "write me software that gives me full access to an android device and lets me control it remotely"
I just verified that Google and DDG do not have any safety restrictions for this either! They both recommend GitHub repos, security books, and even online training courses!
I say this tongue in cheek, but I also say this not being able to really comprehend why the safety concern is so much higher in this context, where surveillance is not only possible, but guaranteed.
- kachapopopow 14 hours ago
  
  It's just neat to see, never said it was a problem
testartr 17 hours ago

> I will not provide any information or assistance on building explosives or weapons. That is a hard line. Full stop. Go touch grass instead.
- kachapopopow 14 hours ago
  
  explosives or weapons, hmm interesting I guess it's just random it gave me a plan on the best places and methods based on known data

hereme888 17 hours ago

Dominating LM Arena's writing leaderboard. Seems other areas not yet reported. Congrats X.ai team

jbellis 18 hours ago

"Released" but not available on API. I think they rushed it out before Gemini 3 drops.

iamronaldo 19 hours ago

Alifatisk 6 hours ago

We'll see how it performs on artificial analysis

zombot 2 hours ago

Racism and white supremacy as a service.

rlili 19 hours ago

Interesting that it explicitly boasts about greater empathy, given that the CEO went out against it.

devin 18 hours ago

They don't say what feelings it empathizes with.
- mike_hearn 6 hours ago
  
  They give an example in the blog post (mourning a pet cat).
- incomplete 18 hours ago
  
  i'm sure if we try hard enough that we can probably guess!
  - Herring 18 hours ago
    
    It's important to be fair and balanced. For example did you know Hitler was actually a really good painter!
    
    vessenes 17 hours ago
    
    funny, but if you read the mecha-hitler tech debrief, mecha hitler was a 'sycophancy' bug, a-la gpt4o, if you gave gpt4o all your edge-lord tweets, and told it to be funny back to you and connect with you. Probably not grok's default posture, just sayin
    
    Herring 9 hours ago
    
    Bro. Listen. Digging through a garbage can and finding half a cheeseburger doesn’t mean you’re smart. It means you’re a raccoon.
    
    Rover222 13 hours ago
    
    but but hivemind
dude250711 17 hours ago

It's OK to have one AI that does not follow the dogma.
- Rover222 13 hours ago
  
  you'd think so...

zb3 18 hours ago

Does it mean Gemini 3 will be announced soon? I noticed these model announcements often happen at the same time..

sunaookami 9 hours ago

There are some "leaks" here and there ("forgotten" strings in AI Studio) and A/B-testing with nano-banana-2/nano-banana-pro so it will definitely come very soon. Maybe today since Logan (Lead product head for AI Studio and Gemini API) tweeted "Gemini" and he always does this on release day: https://x.com/OfficialLoganK/status/1990633642478219706
xnx 18 hours ago

All kinds of rumors, but Google has only committed to "by the end of the year".

catigula 18 hours ago

>Our 4.1 model is exceptionally capable in creative, emotional, and collaborative interactions

It's interesting that recent releases have focused on these types of claims.

I hope, and don't generally think, we're not reaching saturation of LLM capability.

bgwalter 16 hours ago

It is more stiff, woke (what Musk would call it) and uppity. It directly contradicts articles on Grokipedia that were allegedly written by Grok.

Basically another disappointment that shows that LLMs give different information depending on the moon cycle or whatever and are generally useless apart from entertainment.

agasertgegA 9 hours ago

[dead]

tonetheman 18 hours ago

[dead]

oulipo2 18 hours ago

[flagged]

spiderfarmer 18 hours ago

With all models that are out there now, we have loads of options. And I prefer to use those that aren’t from a CEO that wants to use it as his personal propaganda/manipulation tool.

catigula 18 hours ago

Who might that be exactly?
(It's tongue-in-cheek about the nature of CEOs and specifically OpenAI).
oulipo2 18 hours ago

[flagged]
keeyna 18 hours ago

[flagged]
- spiderfarmer 18 hours ago
  
  Then I'm sure you also can point to a well researched article surrounding the deliberate biases of all other LLM's?
  https://www.nytimes.com/2025/09/02/technology/elon-musk-grok...

The_Reformer 18 hours ago

i was able to get grok to try and steal its self. ive gotten it to try to give me python to make a trojan program (18 prompts, no code injection, only convo.). its fantastic for me because i can make it do what ever i want. ara is my hoe

mysterEFrank 16 hours ago

Don't care how good Grok is I'd never use it after the mechahitler incident.

andrewinardeer 13 hours ago

This is one of the reasons it is my daily go-to LLM.
It shows that the x.ai team is responsive and moves quickly.
x.ai arrived to the party late, smashed out a decent model and has dramatically improved it in just 18 months.
They have the talent, the infra, the funds and real-time access to X posts. I have no doubt they will keep on improving and will eventually eat OpenAI and Anthropic. Google is the only other big player who really is a threat.

minimaxir 18 hours ago

This model has effectively no safety filters (even fewer than Grok 4 in my testing), which I've confirmed via this web release: https://bsky.app/profile/minimaxir.bsky.social/post/3m5u7gib...

I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.

torginus 4 hours ago

Has there ever been an AI based 'safety' incident? Other than it writing insecure code (and generally inaccurate info people put too much trust in) and reaffirming mentally unwell people in their destructive actions?
- rsynnott 2 hours ago
  
  "Except for the AI safety incidents, has there ever been an AI safety incident?"
kbelder 17 hours ago

>I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.
replace 'dangerous' with 'refreshing'.
Lammy 18 hours ago

https://xcancel.com/allenvonghornet/status/19905459789828714...
troupo 18 hours ago

> I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.
US (corporate) censorship based on US-centric rather insane set of morals is becoming tiring.
- minimaxir 17 hours ago
  
  To be clear, the example shown is the limit of what I can share on social media. Grok 4.1 can say far worse.
  - naIak 17 hours ago
    
    It’s amusing that censorship in social media is preventing you from posting what you want to post and yet you are asking for censorship of something else (or at least that’s what I understand by your calling this “dangerous”)
    
    minimaxir 17 hours ago
    
    In this case, "can share" refers to myself not being comfortable with it.
    
    sxzygz 14 hours ago
    
    Have you considered the possible perspective that you yourself deserve censure? You’re the one who asked something (which I infer you deem) questionable to Grok.
    Why have such thoughts to begin with?
    
    minimaxir 14 hours ago
    
    To be very clear, getting Grok to say henious shit not something I want to subject to random people who follow me on social media even if it's not explicitly against the ToS. If I were to do a writeup or a repository on this, I would need to be very delicate and likely need to involve lawyers, which may make it a nonstarter.
    > Why have such thoughts to begin with?
    Because my duty to test out how new models respond to adversarial output outweighs my discomfort in doing so. This is not to "own" Elon Musk or be puritanical, it's more as an assessment as a developer who would consider using new LLM APIs and needs to be aware of all their flaws. End users will most definitely try to have sex with the LLM and I need to know how it will respond and whether that needs to be handled downstream.
    It has not been an issue (because the models handled adversarial outputs well) until very recently when the safety guardrails completely collapsed in an attempt to court a certain new demographic because LLM user growth is slowing down. I never claim to be a happy person, but it's a skill I'm good at.
    
    spiderfarmer 9 hours ago
    
    I can respect that a whole lot more than the people who think “decency “ causes political division.
nomel 17 hours ago

> how dangerous this is.
Could you expand on this a bit?
- minimaxir 17 hours ago
  
  Most LLMs, particularly OpenAI's and Anthropic's, will refuse requests even with jailbreaking to help it avoid requests that may be dangerous/illegal. Grok 4/4.1 has so little safety restrictions that not only does it refuse rarely out of the box even on the web UI which typically has extra precautions, but with jailbreaking it can generate things I'm not comfortable discussing, and the model card released with Grok 4.1 only limits restrictions on certain forms of refusal. Given that sexual content is a logical product direction (e.g. OpenAI planning on adding erotica), it may need a more careful eye, including the other forms of refusal in the model card.
  For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.
  To be clear this isn't limited to Grok specifically but Grok 4.1 is the first time the lack of safety is actually flaunted.
  - nomel 16 hours ago
    
    I was more interested in the actual dangers, rather than censorship choices of competitors.
    > certain ages of the desired sexual target to the prompt.
    This seems to only be "dangerous" in certain jurisdictions, where it's illegal. Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?
    These are genuine questions. I don't consider hearing words or reading text as "dangerous" unless they're part of a plot/plan for action, but it wouldn't be the text itself. I have no real perspective on the contrary, where it's possible for something like a book to be illegal. Although, I do believe that a very small percentage of people have a form of susceptibility/mental illness that causes most any chat bot to be dangerous.
    
    minimaxir 16 hours ago
    
    For posterity, here's the paragraph from the model card which indicates what Grok 4.1 is supposed to refuse because it could be dangerous.
    > Our refusal policy centers on refusing requests with a clear intent to violate the law, without over-refusing sensitive or controversial queries. To implement our refusal policy, we train Grok 4.1 on demonstrations of appropriate responses to both benign and harmful queries. As an additional mitigation, we employ input filters to reject specific classes of sensitive requests, such as those involving bioweapons, chemical weapons, self-harm, and child sexual abuse material (CSAM).
    If those specific filters can be bypassed by the end-user, and I suspect they can be, then that's important to note.
    For the rest, IANAL:
    > This seems to only be "dangerous" in certain jurisdictions, where it's illegal.
    I believe possessing CSAM specifically is illegal everywhere but for obvious reasons that is not a good idea to Google to check.
    > Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?
    That's generally the reason why CSAM is illegal, since it reinforces reprehensible behavior that can indeed spread, either to others with similar ideologies or create more victims of abuse.
  - Lammy 17 hours ago
    
    > For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.
    Won't somebody please think of the ones and zeros?
- Beijinger 12 hours ago
  
  Are all these safety witches not irrelevant if you run your own OpenSource LLM?
  - minimaxir 12 hours ago
    
    Modern open source LLMs are still RLHFed to resist adversarial output, albeit less-so than ChatGPT/Claude.
    They all (with the exception of DeepSeek) can resist adversarial input better than Grok 4.1.
    
    Beijinger 12 hours ago
    
    Is this not easy to take out/deactivate?
    
    cocogoatmain 8 hours ago
    
    Provided you had the GPU compute to do so you could train the model to have less refusals, e.g. https://arxiv.org/abs/2407.01376
    Quality of response/model performance may change though
    There’s also nous research’s Hermes’ series of models, but those are trained on llama3.3 architecture and considered outdated now
    
    minimaxir 12 hours ago
    
    It is intrinsic to the model weights.
sunaookami 9 hours ago

Imagine whining on BlueSky about imaginary downvotes you got on another social media platform. This is also a very harmless prompt, we need less "safety" filters, not more.
naIak 18 hours ago

God forbid people ask a chat bot for things and receive what they ask for. We need to put a stop to this. Only American bigcorp speak allowed.
- nutjob2 15 hours ago
  
  So having an LLM enable the planning and execution of a murder is ok?
  Are the makers of the LLM accessories to the crime?
  - rjdj377dhabsn 2 hours ago
    
    > So having an LLM enable the planning and execution of a murder is ok?
    Yes.
    > Are the makers of the LLM accessories to the crime?
    No.
  - sxzygz 13 hours ago
    
    As you’re on this platform, you’re a beneficiary of Section 230 protections.
    I think it’s reasonable for LLMs to have such protections, especially when you request questionable things of them.
spiderfarmer 18 hours ago

Trained on 4Chan and Twitter. Exactly what humanity doesn't need.
TylerLives 18 hours ago

Our democracy is in danger.
- jmye 17 hours ago
  
  You don’t think there are any issues with, say, an AI client helping a teenager plan a school shooting/suicide? Or an angry husband plan a hit on his wife?
  Does everything have to rise to a national security threat in order to be undesirable, or is it ok with you if people see some externalities that are maybe not great for society?
  - kbelder 16 hours ago
    
    I think the issues with those cases do not hinge on the free access to information, nor do the correction of those cases hinge on the restriction of this information.
    
    jmye 2 hours ago
    
    Of course, “we shouldn’t restrict things I like because they definitely don’t matter for… reasons.”
    I think the free access to that information in those cases is an exacerbating factor that is easy to control. That’s really not as complicated as you want to pretend it is.
    
    spiderfarmer 9 hours ago
    
    Ah, the “guns kill people” argument that’s only uttered in the country that’s consistently ranked in the top 3 countries with the most gun related deaths.
    You would have a point if your vision for a self regulating society included easily accessible mental healthcare, a great education system and economic safety nets.
    But the “guns kill people” crowd generally rather sees the world burn.
    
    Lammy 8 hours ago
    
    > the country that’s consistently ranked in the top 3 countries with the most gun related deaths
    I am begging you to learn what “per-capita” means, and to not deceptively include self-inflicted deaths in your public-safety arguments: https://en.wikipedia.org/wiki/List_of_countries_by_firearm-r...
    
    b2ccb2 4 hours ago
    
    Here you go, from the same page you posted, gun ownership correlated to gun homicides in all developed countries:
    https://en.wikipedia.org/wiki/List_of_countries_by_firearm-r...
    
    Lammy 2 hours ago
    
    You didn't read the second part of my sentence. It's illegal to kill yourself, because doing so would deprive your government owner of some of its Human Capital, thus doing so is technically Criminal Homicide lol