Working with Rich Text's MIME Structure

Wed Jul 08 20:28:19 EDT 2015

Tags: mime

My work lately has involved, among other things, processing and creating MIME entities in the format used by Notes for storage as rich text. This structure isn't particularly complicated, but there are some interesting aspects to it that are worth explaining for posterity. Which is to say, myself when I need to do this again.

As a quick primer, MIME is a format originally designed for email which has proven generally useful, including for HTTP and, for our needs, internal storage in NSF. Like many things in programming, it is organized as a tree, with each node consisting of a set of headers (generally, things like "Content-Type: text/html"), content, and children.

Domino stores the text part of rich text in MIME as HTML. In the simplest case, this ends up a one-element "tree", which you can see in the document's properties dialog:

Content-Type: text/html; charset="US-ASCII"

<font size=2 face="sans-serif">Hello <b>there</b></font>

There's slightly more to its full storage implementation (like the MIME_Version item), but the MIME Part items are the important bits. This simple structure can be abstracted to this tree:

  • text/html

Things get a little more complicated when you add embedded images and/or attachments. When you do either of those, the MIME grows to multiple items and becomes a multi-node tree.

Embedded Images

When you add an embedded image in the rich text field, the storage grows to four same-named MIME Part items. Concatenated (and clipped for brevity), the items then look like:

Content-Type: multipart/related; boundary="=_related 006CEB9D85257E7C_="

This is a multipart message in MIME format.

--=_related 006CEB9D85257E7C_=
Content-Type: text/html; charset="US-ASCII"

<font size=3>Here's a picture:</font>
<br>
<br><img src=cid:_2_0C1832A80C182E18006CEB9885257E7C style="border:0px solid;">
<br>
<br><font size=3>Done.</font>

--=_related 006CEB9D85257E7C_=
Content-Type: image/jpeg
Content-ID: <_2_0C1832A80C182E18006CEB9885257E7C>
Content-Transfer-Encoding: base64

*snip*

--=_related 006CEB9D85257E7C_=--

You can see the same sort of HTML block as before contained in there, but it sprouted a lot of other stuff. To begin with, the starting part turned into "multipart/related". The "multipart" denotes that the top MIME entity has children, and the "related" is used when the children consist of an HTML body and inline images. There are delimiters used to separate each part, using the auto-generated convention of "related" plus an effectively-random number. The image itself is represented as a MIME Part of its own, in this case stored inline and Base64-encoded (it can be shifted off to an attachment by Notes/Domino after a certain size). This structure can be abstracted to:

  • multipart/related
    • text/html
    • image/jpeg

The HTML is designed so that there is an image tag that references the attached image using a "cid" URL, an email convention that basically means "find the entity in this related MIME structure with the following content ID" - you can then see the content ID reflected in the JPEG MIME Part. This sort of URL doesn't fly on the web, so anything displaying this field on a web page (or otherwise converting it to a non-MIME storage format) needs to translate that reference to something appropriate for its needs.*

Attachments

When you have a rich text field with an attachment (in this case without the embedded image), you get a very similar structure:

Content-Type: multipart/mixed; boundary="=_mixed 006EBF7C85257E7C_="

This is a multipart message in MIME format.

--=_mixed 006EBF7C85257E7C_=
Content-Type: text/html; charset="US-ASCII"

<font size=3>Here's an attachment: <br>
</font>
<br>
<br><font size=3><br>
Done. </font>

--=_mixed 006EBF7C85257E7C_=
Content-Type: application/octet-stream; name="cert.cer"
Content-Disposition: attachment; filename="cert.cer"
Content-Transfer-Encoding: binary

cert.cer

--=_mixed 006EBF7C85257E7C_=--

The structure is the same sort of tree as previously, but the "related" content sub-type has changed to "mixed". This indicates that there are multiple types of content, but they're conceptually distinct. In any event, the tree looks like:

  • multipart/mixed
    • text/html
    • application/octet-stream

"application/octet-stream" is a generic MIME type for, basically, "bag of bytes" - MIME-based tools use it when they either don't know the content type or, as in this case, don't care. In this case, Notes/Domino splits out the content to be an NSF-style attachment and then references that in the MIME - this is an implementation detail, though, as the API returns the value regardless.

This also highlights a minor limitation in rich text storage: attachments do not have an inline representation in the HTML, and so they are always moved to the end of the field in Notes. At first, I was peeved by this limitation, but it makes a sort of sense: cid references are really about images, and I guess Lotus didn't want to override that for use in normal link elements.

That brings us to the final potential structure you're likely to run across:

Embedded Images And Attachments

When you include both embedded images and attachments, things get slightly more complicated. I'll skip the raw MIME and go straight to the tree:

  • multipart/mixed
    • multipart/related
      • text/html
      • image/jpeg
    • application/octet-stream

So this becomes a combination of the two formats, and a bit of logic emerges. In Notes's structure, "multipart/mixed" always contains two or more children, and the first one is the textual body, whatever form that may take. One of those forms is just a single-part "text/html", and the other is a "multipart/related" subtree containing the "text/html" and one or more images.


Once you get a feel for these structures, it makes the task of reading and creating Notes-alike MIME items much less daunting. There are a number of other concerns I've been dealing with as well (such as the conversion of composite-data rich text to HTML and how there are two ways to do it), and maybe I'll make a followup post at some point about those.


* As a minor note on this point, it's an area where the Notes client and XPages diverge slightly. The Notes client (which generated the example above), leaves inline images "nameless" - they contain no "Content-Disposition" header and no name in the "Content-Type", instead sticking with just the "Content-ID" for identification. With XPages, however, presumably due to the fact that it has filename information during the upload process, the result still contains (and is referenced by) the "Content-ID" value, but it also contains a line like:

Content-Disposition: inline; filename="foo.jpg"

This functions the same way for most purposes, but it may be significant. For example, if you happen to write processing code that uses the presence of absence of the "Content-Disposition" header as an indicator of whether it's an attachment or not, knowing this ahead of time could save you a certain amount of headache. The right way to do it is to see if the header is either missing or has a basic value of "inline" instead of "attachment".

Commenter Photo

Ben Langhinrichs - Thu Jul 09 00:09:14 EDT 2015

Very good summary of how this works and is structured. (I really wish I had had this years ago when I had to figure most of it out on my own, before they even documented the mime structures - ugh) I especially like that you identify the XPages attachments thing, as that becomes key when creating MIME that XPages will display/use properly. Nice job!

Commenter Photo

Toby Samples - Thu Jul 09 00:47:52 EDT 2015

Another interesting difference I found between the XPages ckeditor and the RichText Field in the Notes Client is that in general the Notes Client is a lot more forgiving, for instance.  If your inline image is referenced with quotes ie:  "<img src="cid:_2_0C1832A80C182E18006CEB9885257E7C" style="border:0px solid;">"  around the src attribute Notes Client will be fine with it, however XPages treats it like it doesn't see it.  Well at least this was my experience a couple of versions back.

 

 

Commenter Photo

Jesse Gallagher - Thu Jul 09 10:24:57 EDT 2015

The XPages attachment burned me a bit - I ended up "upgrading" the inline images to full-blown attachments, which was problematic going back. I can appreciate XPages including that information from a completionist perspective, but those sorts of little inconsistencies can be trouble.

Also, Toby, thanks for breaking my blog layout. That's good to know about the quotes around the cid refs - I had been swallowing my pride and maintaining their quoteless form for consistency's sake, but it'll be useful to keep it as a rule to specifically never add them.

Commenter Photo

Bryan Schmiedeler - Fri Jul 10 22:25:22 EDT 2015

Jesse,

This post is very timely. I am struggling with getting the CKeditor and NotesRichText field to play together nicely. 

Can you make a custom control that allows the user to include:

RichText

Attachments

Images

Basically everything, and store it in a RichTextField in a notes document. (BTW I am using XPiNC). I don't care to ever open the notes document in the traditional client. It seems like this just has to be possible, but I have seemingly tried everything and I cannot get it to work.

Any suggestions or help would be greatly greatly appreciated.

 

New Comment