Voice and vision are probably how we'll interact with AI

The tech stack needed to change how we interact with AI already exists. How it all comes together remains open.

Greg Calacouris

11 Oct 2024 — 3 min read

I've been wondering how AI interfaces might evolve beyond a basic chat interface. LLMs have meant computers can understand natural, human conversation. Imagining AI tools as assistants that can also see and hear us feels realistic for two reasons. First, seeing and speaking is more natural for us than typing. Second, the technology to make this possible almost exists today—it’s just a matter of combining it into a single product. Let's look at what's needed to make it a reality:

UI for collaborative work
Vision recognition
Voice recognition
Personality
APIs for LLMs
Form factor

1. UI for collaborative work

Status: Exists

Collaborating with an AI on tasks beyond text is easier when it’s not confined to a chat interface. This could be editing a design file, debugging code, updating a spreadsheet, reviewing a list of flights, or checking your calendar. We're seeing a new UI pattern emerge to tackle this.

Examples:

2. Vision recognition

Status: Practically exists, but there's a creepiness barrier.

The tech to recognize faces or objects already exists, and LLMs make it easier to process that information in real-time. The UI is still immature—mostly limited to basic popup labels—but that will likely evolve if this becomes productized. The biggest challenge will be whether people embrace or reject it due to privacy concerns and how invasive it feels.

Examples:

3. Voice recognition

Status: Exists

Being able to speak into a microphone has existed for years. LLMs meant we are no longer confined to a restrictive decision tree logic. And there are products already in the market. Google have earbuds that translate languages in real-time. The Humane AI Pin and Rabbit R1 may have failed, but they showed voice recognition exists and works with LLMs.

Examples

4. Personality

Status: Exists

Humans have an incredible ability to form emotional connections with everything. I have friends who have named their Roombas and feel a deep attachment to them. Waymo’s self-driving cars have gotten love for their personality once you’re inside, and people have shown a willingness to share personal information with their chatbots. Adding personality is important to build comfort and acceptance around these tools.

Examples

5. APIs for LLMs

Status: Not solved. AI agents show potential.

What hasn’t been fully solved is how to take what you say and use it to interact with application data, like calendar invites, flights, and other tasks. AI agents are a proposed solution, but key questions remain: How will this work in practice? The Rabbit R1 browses and clicks around a webpage for you, but that's fragile. Will companies willingly share their API data and risk website traffic? Will there be another way to connect data with LLMs?

Examples

6. Form factor

Status: Unclear.

All of the above technologies can run on a smartphone. It’s still uncertain whether this will be software-based on existing devices or entirely new hardware. Microsoft is taking the software route, integrating voice and vision in Copilot. Meanwhile, the Humane AI Pin and Rabbit R1 attempted—and failed—to introduce new hardware. Meta’s AR glasses, however, seem to have the most potential right now.

Examples

What's next

With each of these technologies, there are a range of open questions—both technical and ethical. But looking at the existing technology stack, it seems likely that it will converge.

💩 Cool shit

Hoverstat.es – A collection of sites with unique designs or interfaces.

Lynk Product Search – This is one of a new wave of sites offering novel ways to search. Lynk combines an LLM chat interface with product reviews across social media to rethink how find products.

GFWeb Dashboard – A fascinating dashboard that tracks and monitors sites being blocked by China's Great Firewall.

Ancient Greek Farming Simulator – Live out your 300 BCE fantasy. This is actually a fun educational game. Its open text field gives you a freedom to interact with the world.

JamStart – Jam alongside your Spotify playlist.

Calculating Empires – This is a bit hard to follow, but really insightful once you dig into it. It's a cool genealogy chart of technology from 1500 to today.

Can you mastermind a US presidential campaign? – If you can't get enough of the US election, this game by the Financial Times lets you have a go at running for president.

Share this with a friend because We detected a new login into your instagram account.

Voice and vision are probably how we'll interact with AI

Greg Calacouris

1. UI for collaborative work

2. Vision recognition

3. Voice recognition

4. Personality

5. APIs for LLMs

6. Form factor

What's next

💩 Cool shit

Read more

What's happening next to search

Prioritization, news consumption, and how to use generative AI.

Are we just going to be chatting with bots now?

Are subscriptions the only way we'll get news?