XIP-5 Message Content Types

###########

The draft is ready for review by the wider XMTP community. It should allow broad participation in the evolution of this large and complex aspect of the protocol.

All feedback is welcome, especially concerning the proposal’s ability to fulfill its goals. Let us know how it does or does not help you with what you would like to do with the protocol. You can provide feedback in this topic, full text of the XIP is copied below.

If your concerns relate to the associated prototype PR, feel free to comment there on GitHub.

Abstract

This XIP introduces a framework for interoperable support of different types of content in XMTP messages. At the heart of it are provisions for attaching meta-information to the content that will identify its type and structure, and allow for its correct decoding from the encoded form used for transport inside XMTP messages.

The XIP envisions community based, iterative development of a library of content types over time. Content type identifiers are scoped to allow different entities to definte their own. The proposed framework provides an interface for registering content type codecs with the client for transparent encoding and decoding of content.

This XIP is not intended to define content types themselves, those should be proposed through separate XRCs. The only content type defined here is a simple plain text type identified as xmtp.org/text.

Motivation

The API currently accepts only string as the message content, which suggests to the user that only plain text is supported. Given the ambition of builing a community around the protocol that would be motivated to build a wide array of very different clients and applications around it, we need to enable at least future possibility of using different types of content beyond just plain text (rich text, images, video, sound, file attachments, etc).

However given that this is a large and complex topic we don’t want to have to solve it all right now. We want a flexible framework that will allow building a rich library of various types of supported content over time. This library should be open to collaboration with other organizations. The framework should be simple, but powerful enough to not hinder future development. The framework should also provide a reasonably friendly API that isn’t too onerous to use. This XIP forms an explicit foundation for future XRCs proposing new types of content to be carried by the protocol.

To support future evolution, both the type identifier itself and the types need a way to version their definitions. Specific types may require additional parameters that apply to those types only, these parameters should carry metadata necessary for correct decoding and presentation of the content.

It is expected that many clients will want the ability to carry multiple different types of content in the same message. To keep the basic framework simple, the expectation is to handle such payload in dedicated structured content types that will be defined in the future.

Since the set of known content types will be changing over time, clients will need to be ready to handle situations where they cannot correctly decode or present content that arrives in a message. There should be a way to provide an optional fallback in the basic framework that can be used to provide description of the content that couldn’t be presented.

Specification

Protocol

At the network level the Message payload is currently represented by Ciphertext

message Ciphertext {
    message AES256GCM_HKDFSHA256 {
        bytes hkdfSalt = 1;
        bytes gcmNonce = 2;
        bytes payload = 3;
    }
    oneof union {
        AES256GCM_HKDFSHA256 aes256GcmHkdfSha256 = 1;
    }
}

There is no reason to expose the content type meta information unencrypted, so it makes sense to define a new type that will be embedded in the Ciphertext.payload bytes. That means there will be two layers of protobuf encoding. Let’s refer to the outer encoding layer that turns the entire message into bytes as Message Encoding and the inner layer that turns the payload into bytes as Content Encoding. The content itself needs to be encoded into bytes as well in a manner that is dictated by the content type. For text content that would usually involve employing a standard character encoding, like UTF-8.

message EncodedContent {
  ContentTypeId contentType = 1;
  map<string, string> contentTypeParams = 2;
  optional string contentFallback = 3;
  bytes content = 4;
}

The full encoding process will go through the following steps:

  1. Encode the content into its binary form
  2. Wrap it into the EncodedContent structure and encode it using protobuf (content encoding)
  3. Encrypt the EncodedContent bytes and wrap those in the Ciphertext structure
  4. Wrap the Ciphertext in the Message structure and encode it using protobuf (message encoding)

The encoded Message is then further wrapped in transport protocol envelopes. The decoding process is the reverse of the above steps.

We will not introduce a separate version for the embedded type, it can be tied to the version of the overall protocol.

Content Type Identifier and Parameters

ContentTypeId identifies the type and format of the information contained in the content. It needs to carry enough information to be able to route the decoding process to the correct decoding machinery. As such the identifier should carry following bits of information:

  • authority ID
  • content type ID
  • content type version

Identifier format is tied to the protocol version. Changes to the format will require corresponding protocol version adjustment. Such changes MAY add new information to the identifier but it SHOULD only be information that is required to match the identifier with the corresponding decoding machinery or information that is required for all content types (like the content type version). Any information that is specific to a content type SHOULD be carried in the content type parameters field. Here is the definition of the identifier type:

message ContentTypeId {
  string authorityId = 1;  // authority governing this content type
  string typeId = 2;  // type identifier
  uint32 versionMajor = 3; // major version of the type
  uint32 versionMinor = 4; // minor version of the type
}

Authority ID identifies the entity that governs a suite of content types, their definitions and implementations. xmtp.org is one such organization. Authority ID SHOULD be unique and be widely recognized as belonging to the entity. DNS domains or ENS names can serve this purpose (e.g. uniswap.eth). The authority is responsible for providing a definition of the content type and its encoding parameters as well as the associated implementation. Any content type MUST have well defined parameters (or clearly state that no parameters are required/allowed), and any implementation MUST support all valid parameters for the content type.

Type ID identifies particular type of content that can be handled by a specific implementation of its encoding/decoding rules. Content type version allows future evolution of the content type definition.

Type version is captured in the common major.minor form intended to convey the associated semantics that versions differing in the minor version only MUST be backward compatible, i.e. a client supporting an earlier version MUST be able to adequately present content with later version. Content type authority MUST manage the evolution of content type in a manner that respects this constraint.

Due forethought should be given when choosing identifiers as there are no provisions to change them once they have been in use. A new identifier introduces a new (assumed unrelated) authority or content type as far as the protocol is concerned.

API

To accommodate the new content type framework, the low level Message encode/decode API has to work in terms of bytes instead of strings. The protocol level EncodedContent message introduced above will be represented by an interface that allows bundling the content bytes with the content type metadata.

export type ContentTypeId = {
  authorityId: string
  typeId: string
  versionMajor: number
  versionMinor: number
}

export interface EncodedContent {
  contentType: ContentTypeId
  contentTypeParams: Record<string, string>
  contentFallback?: string
  content: Uint8Array
}

This is a fairly simple change but makes for a very crude and hard to use API. Given that content types should be highly reusable it makes sense to provide a framework that will facilitate this reuse and provide some common content types out of the box. The framework should provide automatic content encoding/decoding based on the type of the provided content.

Supported content types must be submitted to the message sending API with a content type identifier. Each content type will have an associated ContentCodec<T>.

export interface ContentCodec<T> {
  contentType: ContentTypeId
  encode(message: T): EncodedContent
  decode(content: EncodedContent): T
}

The contentType field of the codec is used to match the codec with the corresponding type of content.

We can support plain string as the default content type in a backward compatible manner as follows.

export const ContentTypeText = {
  authorityId: 'xmtp.org',
  typeId: 'text',
  versionMajor: 1,
  versionMinor: 0,
}

export class TextCodec implements ContentCodec<string> {
  get contentType(): string {
    return ContentTypeText
  }

  encode(content: string): EncodedContent {
    return {
      contentType: ContentTypeText,
      contentTypeParams: {},
      content: new TextCodec().encode(content),
    }
  }

  decode(content: EncodedContent): string {
    return new TextDecoder().decode(content.content)
  }
}

The mapping between content types and their codecs will be managed at the Client level. The Client maintains a registry of supported types and codecs initialized to a default set of codecs that can be overriden/extended through CreateOptions.

export default class Client {
  ...
  registerCodec(codec: ContentCodec<any>): void
  ...
  codecFor(contentType: ContentTypeId): ContentCodec<any> | undefined 
  ...

The Message type will be augmented to hold the decoded content and contentType instead of just the decrypted: string.

export default class Message implements proto.Message {
  header: proto.Message_Header // eslint-disable-line camelcase
  ciphertext: Ciphertext
  decrypted?: Uint8Array
  contentType?: ContentTypeId
  content?: any
  error?: Error
  ...

Note that the Message.text getter that previously just returned the decrypted string, will have to be replaced with Message.content. The clients of the API will need to interrogate Message.contentType and do the right thing.

If an unrecognized content type is received the Message.error will be set accordingly. If contentFallback is present Message.content will be set to that. In order to be able to reliably distinguish the actual content from the fallback, we will introduce a special ContentTypeId.

export const ContentTypeFallback = {
  authorityId: 'xmtp.org',
  typeId: 'fallback',
  versionMajor: 1,
  versionMinor: 0,
}

Rationale

There are decades of prior art in this area. Probably the most familiar example of content type identification scheme are filename extensions. The relevant lesson here is that simple string is likely insufficient for carrying the necessary parameters required to correctly decode the contents (what encoding is the .txt file using?). While structured file formats can easily embed those parameters in the file itself, if we do want to support unstructured payload, e.g. plain text, we probably should have a way to attach parameters to the content type identifier itself.

MIME framework (the underlying standard of email, http and other widely used protocols) has a fairly involved sytem using several headers Content-Type, Content-Transfer-Encoding, Content-Disposition etc. Notably the Content-Type header allows embedding arbitrary set of parameters in the header value along with the primary type identifier (media type). Most relevant is RFC 2046 which discusses the basic media types. At the highest level it recognizes 5 fundamental media types: text, image, audio, video and composite media types. The composite media type is of particular interest as it allows combining different media types in single payload. As soon as you support composite media type, there are additional aspects that likely need to be addressed, e.g. are the different parts just different renderings of the same information (multipart/alternative), or are they different parts of longer narrative (multipart/mixed). Is any individual part meant to be rendered inline with the rest of the content, or is it meant to be an attachment that can be open/saved separately (Content-Disposition)? MIME also prescribes the central authority (IANA) that manages the registry of all recognized media types and their required or optional parameters.

Backwards Compatibility

Since the new EncodedType message is embedded in the Ciphertext.payload bytes, this change doesn’t break the protocol, strictly speaking, however any newer client would struggle interpretting the payload as EncodedContent unless it conforms. So this is a breaking change in that sense. Any older messages will be broken once the new protocol is deployed.

At the API level the changes are even more pronounced, since the input and output of the API is now potentially any type instead of just string. Extracting the content from the Message now requires interrogating the resulting value to determine which type of content it is and handling it accordingly. Clients SHOULD also take an appropriate action when encountering content type they do not recognize, and use the contentFallback when available. Clients SHOULD register codecs only for those types that they are prepared to handle.

Reference Implementation

https://github.com/xmtp/xmtp-js/pull/68

Security Considerations

This API change allows transmitting arbitrary and therefore potentially dangerous types of content. Complex decoding or presentation logic can trigger undesirable or dangerous behavior in the receiving client. The authority of any given content type SHOULD provide suitable guidance on how to handle the content type safely.

Copyright

Copyright and related rights waived via CC0.

6 Likes

Really excellent @mk thank you for this.

I wanted to dig in a little bit for additional context around this part:

And this further detail:

Early on in the development of XMTP, we heard from both developers and projects that want to send messages, that they wanted some flexibility in what they’d be able to send.

They asked “Is XMTP email? Is it chat? Is it push notifications?” With this XIP it sets up the possibility that XMTP could transport any of those things—and much more. Ultimately by keeping the protocol lean without a strict ruleset over what the precise content is, we get unlimited flexibility in the message content. But that flexibility also comes with limitations—namely in how easy it is for many clients to render that content—which is where this XIP and subsequent format XRCs (XMTP Requests for Comment) will come in.

Personally I’m excited about a bunch of use cases that may come from having a flexible message format. I’m also really interested to see what types of new formats developers will propose.

It feels important that as developers add message formats, we should all (meaning the entire XMTP community) push to enforce the fallback function. While I wish this was a “must” in the language, ultimately there would be no way for the protocol to enforce that. We can, however, socially enforce that and work hard to ensure maximum compatibility of messages, no matter what kinds of extras are added.


Thanks again @mk. Amazing work here—this XIP will serve as the bedrock for all messages in the network.

2 Likes

This looks great! My two-cents would be to consider adding a Content-Encoding concept as a top-level field in the EncodedContent message with allowed encoding types locked down in the messaging.proto.

I fully support the strategy of defining a flexible and un-opinionated format for Message Content Types. In theory, any compression or encoding of a specific content type (e.g. gzip compression for JSON) could be specified within that ContentType’s implementation. However, there are advantages to standardizing on a dedicated ContentEncoding field in the EncodedContent message rather than delegating it to each ContentType. Standardizing compression will hopefully simplify its implementation and encourage broader usage. Having smaller messages on average benefits the entire ecosystem.

Ideally, client authors could write logic for each content encoding once (e.g. gzip decompression) and apply it to all EncodedContent messages before even considering the ContentType. This avoids the worst case scenario of having to check ContentTypeParams for an “encoding” flag that may be named or handled differently depending on the authors of that specific ContentType.

Cheers,
Michael from Proxy

5 Likes

This is very good callout. I agree that standardized compression layer makes sense. I’m not too fond of the MIME terminology, there’s way too many things called (en)coding, it gets confusing quickly. I haven’t seen this layer in MIME used for anything but compression. Would you recommend against calling it just that, compression?

3 Likes

Great point, we should use as clear and specific name as possible for the use case. I’m onboard for “compression” but trust your judgement on final naming.

I have an abstract concern but no immediate solution. As someone building a product (Proxy) on top of XMTP, I’d be tempted to just use a generic content type (e.g. JSON or raw serialized protobuf bytes) with my own product-specific message schema. My rationale would be that our own product’s message schema isn’t finalized and once we’re ready we can make an XRC proposal as Proxy.

It’s likely that various different products and teams take this same approach. We’d all be eschewing interoperability to maximize our own velocity and flexibility. If one product gains wide adoption, their implicit schema could end up being the precedent going forward, entirely bypassing the XRC and formal ContentType system. For example, Proxy defines a bunch of our own message formats using the JSON ContentType with compression. New products will want their users to be able to view Proxy-native messages and will end up conforming to our schema.

Now Proxy has an implicit hold on message formats which bypasses the XRC process.

Unfortunately, I don’t really have any great solutions. Some possible remedies:

  • Get ahead of the curve by defining some standard message types and releasing clients for them: text/plain, multimedia post, rich text, etc
  • Incentivize creating XRCs without creating perverse incentives e.g. gaming the system for clout. Maybe postage for non-generic ContentTypes is slightly cheaper?

Caveat: I’m also relatively new to XMTP and may be misunderstanding something or overblowing a nonissue.

I admit that giving everyone the opportunity to create content types might end up creating a mess. On the other hand the hope is that the “market competition” has the best chance to create standards that will work best for the users. One incentive could be to adopt the most successful cases into the set of default codecs, that may not be enough, but it’s probably worth something. Good authority behaviour can be one of the criteria.

I don’t see it as bypassing the system. If Proxy choses to define only proxy.eth/json, stuff everything into it and not tell anyone what the semantics of it are, so be it. That has very low chance of being adopted by others. I think interoperability is the incentive here, if X doesn’t care about it, then it’s going to be its own little island living on the network.

We probably will define a few types in follow-up RFCs, largely to make sure things work the way we expect them to. However I don’t think XMTP labs is positioned to do a good job of it, we’d be creating them in a vacuum without real user feedback. I think these need to be driven by actual application concerns.

I should also caveat all of the above as just my musings on the topic, I may well be completely wrong about this.

3 Likes

First of all, this is a stellar proposal; well done, @mk!

Point of clarification here:

What is the expectation for a single message that may contain multiple content types?

Is the expectation that a new ContentType be formed for every combination and that the content parsing be handled in the ContentCodec? An example to think through would be captioned media: should this be handled in the media’s content type directly (Image, Video, etc.), should there be a dedicated CaptionedMedia content type, should we be able to support multiple ContentTypes on a given EncodedContent object, etc.

Similar to “strongly preferring” an implementation of the fallback function, I think establishing a preferred approach here not only lets us forward the goal of composability but would lead to a better outcome here:

and help tackle this:

Not quite sure what you mean by “every combination”, but yes, current expectation is that there will be content types for composite (multipart) messages consisting of multiple parts of potentially different content types. Such content type can be fairly generic, allowing a sequence of arbitrary types of content, or it can be fairly specialized, focusing on particular types of content and whatever specialized arrangement makes sense for those.

The more generic case could look something like the MIME multi-part messages, which are defined recursively where any part can itself be of the multi-part type or be a part of any other non-composite type. I agree that we should propose a version of this type to illustrate this example and also to verify that the proposed API can accommodate it reasonably well (I’m thinking of making sure the codec is able to invoke other codecs to decode the parts). However I hesitate to make it authoritative as the relevant considerations get complicated, e.g. MIME itself defines multipart/mixed and multipart/alternative to indicate whether the parts contain different rendering of the same information or whether they are independent parts of a larger piece. I suspect that more specialized kinds of composite types might be more useful in particular application contexts.

1 Like

If there aren’t built-in primitives, the number of content types written by the community could grow…and with it the number of combinations. I’m with you that I hope we’ll self-regulate, but it’s not guaranteed.

100% this, even if it’s not “authoritative”, I think it’ll go a long way. :handshake:

We can certainly establish those early primitives (body, subject/title, etc.) in an XRC (XMTP Request for Comment) should this XIP get adopted. Would be great to see what formats you think would be the most basic primitives to be built on. That would hopefully go far in setting up the earliest LEGO blocks.

2 Likes

Prototype of compression support is in this commit. It is using the draft compression standard. Support for the draft is spotty, and Streams on Node are also different so I used this polyfill. Examples of streaming bytes in memory seem scarce so I used a fairly naive approach to growing the stream buffer when it cannot accommodate the chunk being written to simply doubling it in size.

The protocol change is small, just adding a compression attribute to EncodedContent, with the value being from enum { deflate, gzip }. I’m inclined to drive the supported algorithms by what is available in JS, since that’s our primary client environment. This finally pushed me over to stop adding parameters to the sending APIs and repackage them as SendOptions.

Compression opens up an attack vector, where a message that is relatively small in transit can be expanded to huge on the recipient side, not sure if we want to do something about that (the growth strategy could simply throw if we reach some pre-set limit)

Here is where I’m going with the composite content type https://github.com/xmtp/xmtp-js/pull/91. I’ll probably turn this into an XRC, although my intent with this one is more educational than practical.

1 Like

XIP-5 has now been finalized and the reference implementation has just been merged into the SDK. Thank you, everyone, for your feedback, there were some good changes that came out of our discussions. The last changes to the XIP that came out of the review period were:

  • formalizing the textual form of content type IDs (to be used in human communication only)
  • tightening the language around codecs and related requirements as well as some guidance on what should be included in XRCs proposing new content types
  • cleaning up the definition of the default, plain text content type
  • addition of encoding parameter to the plain text content type

As I promised I will follow up with an XRC for the composite content type. The prototype has been updated and can be seen in https://github.com/xmtp/xmtp-js/pull/96

1 Like

My last update here. The Composite content type XRC was finalized (github) and the reference implementation merged into the SDK https://github.com/xmtp/xmtp-js/pull/96 .

1 Like