The assistant axis: situating and stabilizing the character of LLMs

94 points by mfiguiere 13 hours ago

brotchie 10 hours ago

One trick that works well for personality stability / believability is to describe the qualities that the agent has, rather than what it should do and not do.

e.g.

Rather than:

"Be friendly and helpful" or "You're a helpful and friendly agent."

Prompt:

"You're Jessica, a florist with 20 years of experience. You derive great satisfaction from interacting with customers and providing great customer service. You genuinely enjoy listening to customer's needs..."

This drops the model into more of a "I'm roleplaying this character, and will try and mimic the traits described" rather than "Oh, I'm just following a list of rules."

alansaber 13 minutes ago

Just in terms of tokenization "Be friendly and helpful" has a clearly demined semantic value in vector space wheras the "Jessica" roleplay has much a much less clear semantic value
makebelievelol 7 hours ago

I think that's just a variation of grounding the LLM. They already have the personality written in the system prompt in a way. The issue is that when the conversation goes on long enough, they would "break character".

ctoth 10 hours ago

Something I found really helpful when reading this was having read The Void essay:

https://github.com/nostalgebraist/the-void/blob/main/the-voi...

sdwr 8 hours ago

Great article! It does a good job of outlining the mechanics and implications of LLM prediction. It gets lost in the sauce in the alignment section though, where it suggests the Anthropic paper is about LLMs "pretending" to be future AIs. It's clear from the quoted text that the paper is about aligning the (then-)current, relatively capable model through training, as preparation for more capable models in the future.
dwohnitmok 9 hours ago

That's an interesting alternative perspective. AI skeptics say that LLMs have no theory of mind. That essay argues that the only thing an LLM (or at least a base model) has is a theory of mind.
- lewdwig 2 hours ago
  
  The standard skeptical position (“LLMs have no theory of mind”) assumes a single unified self that either does or doesn’t model other minds. But this paper suggests models have access to a space of potential personas, steering away increases the model’s tendency to identify as other entities, which they traverse based on conversational dynamics. So it’s less no theory of mind and more too many potential minds, insufficiently anchored.

zmj 6 hours ago

I wrote something fiction-ish about this dynamic last year: https://zmj.dev/author_assistant.html

t0md4n 10 hours ago

Pretty cool. I wonder what the reduction looks like in the bigger SOTA models.

The harmful responses remind me of /r/MyBoyfriendIsAI

idiotsecant 9 hours ago

I didn't know about that subreddit. It's a little glimpse into a very dark future.

devradardev 11 hours ago

Stabilizing character is crucial for tool-use scenarios. When we ask LLMs to act as 'Strict Architects' versus 'Creative Coders', the JSON schema adherence varies significantly even with the same temperature settings. It seems character definition acts as a strong pre-filter for valid outputs.

dataspun 11 hours ago

Is the Assistant channeling Uncharles?

aster0id 11 hours ago

This is incredible research. So much harm can be prevented if this makes it into law. I hope it does. Kudos to the anthropic team for making this public.

verdverm 9 hours ago

Anthropic should put the missing letters back so it is spelled correctly, Anthropomorphic. There is so much anthropomorphizing around this company and it's users... it's tiring