so, you've seen ™ and ™️ before. but like. why are there two. well, i have an explanation! the answer is: FE0F
first, unicode. unicode is a standard definition of a bunch of codepoints, where a codepoint is just a number with meaning. for example, unicode codepoint U+263A
refers to ☺︎, or "White Smiling Face", and U+1F431
refers to 🐱, or "Cat Face"
so, lets start by looking at the codepoints for ™. decoding it, it becomes the codepoint U+2122
, referred to as "Trade Mark Sign". this was added in unicode 1.1 in 1993, a decent time ago!
next, the codepoints for ™️. decoding it, we get two codepoints! U+2122
(™︎) and U+FE0F
. wait. who is FE0F
. why is he in my emoji
well, unicode isn't as simple as a series of codepoints that refer to single characters. take a look at é̗
for example. this is three codepoints, U+0065
(Latin Small Letter E), U+0301
(Combining Acute Accent), and U+0317
(Combining Acute Accent Below). the first codepoint is simple enough, it's just e
. the next two, however, are combining codepoints. this means that they combine with the codepoint before them to modify it. U+0301
adds an acute accent above the previous codepoint, and U+0317
adds an acute accent below the previous codepoint. this example specifically isn't very useful (i don't know any language with a é̗
character beyond conlangs), but it becomes very useful for languages that use a lot of diacritics. imagine if we had to make a new set of characters for each set of possible diacritics! big waste of space, we shouldn't have done that!
so, what is U+FE0F
? well, it's a special codepoint called "Variation Selector-16". variation selectors are a reserved block of 16 unicode codepoints. only some have been defined, but among those currently in use are U+FE0E
(VS15) and U+FE0F
(VS16). from wikipedia: "VS15 and VS16 are reserved to request that a character should be displayed as text or as an emoji respectively." so, what's happening with ™️ is that it's combining a U+2122
(™) and a U+FE0F
(Variant Selector-16) to create an emoji version of ™. they're the same character, just that one has been instructed to become an emoji!
also, for the interested, here's the word "unicode" with a shit ton of combining characters: ù́̂̃̄̅̆̇̈̉n̖̗̘̙̐̑̒̓̔̕i̡̢̧̨̠̣̤̥̦̩c̴̵̶̷̸̰̱̲̳̹ò͇͈͉́͂̓̈́͆ͅd͓͔͕͖͙͐͑͒͗͘eͣͤͥͦͧͨͩ͢͠͡. what appears to be seven letters is actually 77 codepoints, taking up 147 bytes when encoded in utf-8. or 156 in utf-16. or 312 in utf-32. why does anyone use utf-16 if it's longer? historical reasons :3
TL;DR: ™️ is ™︎ but instructed to be an emoji