XIP-36: Supporting Frames In XMTP

Ever since Frames Friday, the XMTP community has been exploring how we can integrate Frames into the private messaging experience of XMTP. This draft XIP explores one possible integration point.

Abstract

What new tooling is required for XMTP Client Apps to be able to render and interact with Frames inside XMTP conversations? This takes a deep dive through the Farcaster Frames spec and evaluates how compatible it is with XMTP.

Motivation

Developers on XMTP have been asking for a way to embed generic interactive content into messages for a long time. XMTP Content Types make sense for the head of the curve usecases (replies, reactions, media embeds) where the requirements are well known ahead of time, but there is a very long long tail of demands from developers. Making application devs support hundreds of different content types to handle all those needs was never going to be a practical solution. We need a “one size fits all” way of enabling the long tail.

We could build a competing protocol that would work for this, but I don’t see why we would create a new standard when one already exists and has traction. The question is: how could we extend Frames to work inside XMTP conversations, and what are the trade-offs?

The goal here is to allow existing Frames developers to reach users in private DM channels outside of Farcaster, and in public Farcaster posts, with the minimum amount of code change for the Frame developer.

Specification

There are three new capabilities required for client applications to be able to render and interact with Frames.

  1. Rendering. Client apps need to be able to render the “initial frame” before any interaction
  2. Interaction. Client apps need to be able to interact with Frames, and the HTTP POST requests in those interactions need to include signed content that can irrefutably identify the sender as the holder of an XMTP identity
  3. Verification. Frame developers need to be able to read the HTTP POST requests from #2 and verify the signatures, allowing them to provably know who clicked the button

Rendering

Users already include URLs in standard XMTP ContentTypeText messages. Some client applications choose to render link previews for those URLs. Frames would just be an extension of that link preview functionality.

In order to render a Frame, developers just need to be able to render some special HTML meta tags into UI, similar to how they already render OpenGraph tags. This is a typical frame payload:

<meta property="fc:frame" content="vNext" />
<meta property="fc:frame:image" content="http://...image-question.png" />
<meta property="fc:frame:button:1" content="Green" />
<meta property="fc:frame:button:2" content="Purple" />
<meta property="fc:frame:button:3" content="Red" />
<meta property="fc:frame:button:4" content="Blue" />
<meta property="fc:frame:button:4" content="Blue" />
<meta
  property="fc:frame:post_url"
  content="https://my-frame-server.com/api/post"
/>

XMTP client app devs don’t need any special help from us to be able to render an existing Farcaster Frame.

The problem here is that some Frames will be able to support POST requests from XMTP clients while others will only be able to receive POSTs from Farcaster clients that include a signed payload with a fid. I am proposing that Frame developers add a new meta tag to posts to tell clients that they are capable of receiving XMTP responses.

<meta
  property="xmtp:frame:post-url"
  content="https://my-frame-server.com/api/post/xmtp"
/>

We have developed a prototype implementation of a rendering helper library here.

Interaction

When a user clicks a button in a Frame, the application POSTs a signed message to the frame URL (or a different URL specified in the initial frame by the post_url tag).

This is an example Farcaster Frame payload:

{
  "untrustedData": {
    "fid": 2,
    "url": "https://fcpolls.com/polls/1",
    "messageHash": "0xd2b1ddc6c88e865a33cb1a565e0058d757042974",
    "timestamp": 1706243218,
    "network": 1,
    "buttonIndex": 2,
    "castId": {
      "fid": 226,
      "hash": "0xa48dd46161d8e57725f5e26e34ec19c13ff7f3b9"
    }
  },
  "trustedData": {
    "messageBytes": "d2b1ddc6c88e865a33cb1a565e0058d757042974..."
  }
}

The trustedData.messageBytes is a serialized Protobuf of a signed Farcaster message, with the message payload matching this Protobuf schema:

message FrameActionBody {
  bytes frame_url = 1;    // The URL of the frame app
  bytes button_index = 2; // The index of the button that was clicked
  CastId cast_id = 3;     // The cast which contained the frame URÄ˝
}

// MessageType and MessageData are extended to support the FrameAction
enum MessageType {
  .....
  MESSAGE_TYPE_FRAME_ACTION = 13;
}

message MessageData {
  oneof body {
    ...
    FrameActionBody frame_action_body = 16
  }
}

Developers are expected to call the validateMessage API on a hub, which would return a response that validates the signatures on the message.

{
  "valid": true,
  "message": {
    "data": {
      "type": "MESSAGE_TYPE_FRAME_ACTION",
      "fid": 21828,
      "timestamp": 96774342,
      "network": "FARCASTER_NETWORK_MAINNET",
      "frameActionBody": {
        "url": "aHR0cDovL2V4YW1wbGUuY29t",
        "buttonIndex": 1,
        "castId": {
          "fid": 21828,
          "hash": "0x1fd48ddc9d5910046acfa5e1b91d253763e320c3"
        }
      }
    },
    "hash": "0x230a1291ae8e220bf9173d9090716981402bdd3d",
    "hashScheme": "HASH_SCHEME_BLAKE3",
    "signature": "8IyQdIav4cMxFWW3onwfABHHS9IroWer6Lowo16AjL6uZ0rve3TTFhxhhuSOPMTYQ8XsncHc6ca3FUetzALJDA==",
    "signer": "0x196a70ac9847d59e039d0cfcf0cde1adac12f5fb447bb53334d67ab18246306c"
  }
}

Other Farcaster APIs can then be used to link the fid to a blockchain account, and to retrieve other data from the user’s profile.

There are clearly parts of this flow that are very Farcaster-specific. It references fid, cast_id, and network. I don’t think any of this is a blocker to interoperability, but it will require a little more branching code for developers to support both XMTP and Farcaster Frames.

At a high level, what developers are doing with all this code is validating three things:

  1. That the POST coming in was from a legitimate button click
  2. What button did they click
  3. Who clicked the button

With that information, the developer then responds with HTML outlining an update to the Frame’s state using the same format as the initial render.

Over the weekend Coinbase released FrameKit, with some high level APIs designed to verify
Frame POST payloads. Those APIs are (getFrameAccountAddress, getFrameMetadata,
getFrameValidatedMessage). XMTP already has all the primitives needed to craft payloads that can enable these same workflows.

We have developed a prototype implementation of a privacy-preserving interaction library here.

Verification

With the goal of making things as straightforward for existing Frames devs as possible, here is a proposed interaction payload structure for an XMTP Frame.

{
  "untrustedData": {
    "wallet_address": "0x12345...",
    "url": "https://fcpolls.com/polls/1",
    "timestamp": 1706243218,
    "buttonIndex": 2
  },
  "trustedData": {
    "messageBytes": "d2b1ddc6c88e865a33cb1a565e0058d757042974..."
  }
}

The messageBytes would be an XMTP-specific Protobuf message that would include the payload and the signature. Something like:

message FrameActionBody {
  string frame_url = 1;
  uint32 button_index = 2;
  uint64 timestamp = 3;
}

message FrameAction {
  Signature signature = 1; // XMTP already has signature types that can be used here
  SignedPublicKeyBundle signed_public_key_bundle = 2; // The SignedPublicKeyBundle of the signer, used to link the XMTP signature with a blockchain account through a chain of signatures.
  bytes action_body = 3; // Serialized FrameActionBody message, so that the signature verification can happen on a byte-perfect representation of the message
}

These payloads are meaningfully different from the FC payloads. We can provide some helper libraries that would allow a Frames developer to take the JSON payload above and get back the information required to move forward to the next Frame (isValid, verifiedSenderWalletAddress, buttonIndex). Because XMTP is a private network, instead of a public one, our goal should be to include the minimum amount of metadata to make this work.

On the sender side (the person clicking the button), XMTP SDKs already have generic interfaces for signing messages. All we need to do is define the payload format and any application developer should be able to generate a payload matching this format and transmit it directly to the frame URL (see privacy addendum below).

The FrameAction includes a signature of the FrameActionBody encoded as bytes. To verify and recover the wallet address from this payload a developer must:

  1. Recover the public key from the signature
  2. Verify that the public key matches the identity key in the SignedPublicKeyBundle
  3. Recover the wallet address from the signature in the SignedPublicKeyBundle

We have developed a prototype implementation of a verification library that does the above here.

We could choose to add some additional metadata to this payload (message_id, conversation_id come to mind) that would give the Frame developer more context on where the click originated from. The trade-off here is privacy. All messages on the XMTP network are stored in topics. Topics have an opaque identifier, and only the participants of a conversation know that a given topic is “theirs”. Adding either the message_id or conversation_id to the payload would leak that information to the Frame developer, which would allow the Frame developer to track how many messages have been sent in a particular conversation. That feels like an unnecessary privacy leak and should be avoided unless there is a very strong need for that information to be shared with the Frame developer.

Backward Compatibility

All Frames would be sent through ordinary XMTP text messages. Client applications can choose to expand URLs found in these messages into Frames at their discretion. No new Content Types are required, and no new SDK changes are necessary for XMTP SDKs that allow application developers to sign messages with the XMTP identity key.

The new helper libraries developed to facilitate Frames rendering, interaction, and verification are strictly opt-in.

Reference Implementations

We have developed reference implementations for all three of the components required to enable Frames on XMTP

  1. The rendering/interaction helper library (needs some updates to accommodate changes to the POST message schema made since development)
  2. The POST payload verification helper library (needs some updates to accommodate changes to the POST message schema made since development)
  3. The Open Graph Proxy service (still missing support for proxying requests for images, which will be a hard requirement for launch).

Open questions

  1. Where does the verification library live? Would be great if we could just get it integrated into Coinbase’s FrameKit
  2. What improvements can we propose to the Frames spec to enable even higher quality applications?
  3. How do we maintain compatibility as Farcaster releases updates to the Frames specification? It is currently in alpha and the API should be considered unstable.
  4. How do we protect Frames where the default image includes sensitive data about the user. Frame images must be public.

Security Considerations

In the Farcaster model of Frames, messages are signed on a server by the application and the server is the one doing the POST request to the Frame URL. This means that the Frame app developer only sees requests coming from the server IP and never the client. Farcaster clients such as Warpcast, however, do download the Frame image directly. This leaks the IP address of the viewer to the Frame developer. According to a reply on X, they are “working on it”.

In the proposed scheme above, messages would be signed and sent directly from the client. Privacy is a hard requirement for any adoption of Frames on the XMTP network. Without it, I can create a maliciously crafted Frame that logs IP addresses and send a link to it to an anon account. This would allow me to learn their IP the moment they view the message. We absolutely cannot leak client IP addresses to Frame app developers in either the Rendering or Interaction phase.

This can be solved by having developers route these requests through a proxy server to anonymize the sender. I’ve already started prototyping what a simple Frame proxy would look like. This proxy server should be used for the initial frame rendering, downloading of the Frame image, and interacting with POST requests. Client app developers can host their own instance of this open source proxy, and I propose XMTP labs should run an instance as a public good. Developers can also use this proxy server to privately gather the information needed for link previews, which is a nice added bonus.

At some scale this becomes challenging. Signal Protocol previously used a proxy for link previews, but because of their massive scale they started getting blocked by popular websites like YouTube and had to roll the feature back. Having many proxy services instead of a single proxy will help avoid this problem, but at some scale we will need to reconsider the approach.

5 Likes

Here’s a quick preview of a current implementation of Frames in our example chat app.