Module:Unicode convert
Appearance
![]() | This module is rated as ready for general use. It has reached a mature form and is thought to be relatively bug-free and ready for use wherever appropriate. It is ready to mention on help pages and other Wikipedia resources as an option for new users to learn. To reduce server load and bad output, it should be improved by sandbox testing rather than repeated trial-and-error editing. |
![]() | This module is subject to page protection. It is a highly visible module in use by a very large number of pages, or is substituted very frequently. Because vandalism or mistakes would affect many pages, and even trivial editing might cause substantial load on the servers, it is protected from editing. |
Usage
Converts Unicode character codes, always given in hexadecimal, to their UTF-8 or UTF-16 representation in upper-case hex or decimal. Can also reverse this for UTF-8. The UTF-16 form will accept and pass through unpaired surrogates e.g. {{#invoke:Unicode convert|getUTF8|D835}}
→ D835. The reverse function fromUTF8
accepts multiple characters, and can have both input and output set to decimal.
When using from another module, you may call these functions as e.g. unicodeConvert.getUTF8{ args = {'1F345'} }
, without a proper frame
object.
To find the character code of a given symbol (in decimal), use e.g. {{#invoke:ustring|codepoint|\🐱}} → 128049.
Code | Output |
---|---|
{{#invoke:Unicode convert|getUTF8|1F345}} |
F0 9F 8D 85 |
{{#invoke:Unicode convert|getUTF8|1F345|base=dec}} |
240 159 141 133 |
{{#invoke:Unicode convert|fromUTF8|F0 9F 8D 85}} |
Script error: The function "fromUTF8" does not exist. |
{{#invoke:Unicode convert|fromUTF8|240 159 141 133|base=dec|basein=dec}} |
Script error: The function "fromUTF8" does not exist. |
{{#invoke:Unicode convert|getUTF16|1F345}} |
D83C DF45 |
{{#invoke:Unicode convert|getUTF16|1F345|base=dec}} |
55356 57157 |
See also
local p = {}
p.getUTF8 = function (frame)
local ch = mw.ustring.char(tonumber(frame.args[1], 16))
local bytes = {mw.ustring.byte(ch, 1, -1)}
local format = ({ -- TODO reduce the number of options.
['10'] = '%d',
dec = '%d',
LChex = '%02x',
LC16 = '%02x',
['Lower Case Hex'] = '%02x',
['Lower Case 16'] = '%02x'
})[frame.args['base']] or '%02X'
for i = 1, #bytes do
bytes[i] = format:format(bytes[i])
end
return table.concat(bytes, ' ')
end
p.getUTF16 = function (frame)
local codepoint = tonumber(frame.args[1], 16)
local format = ({ -- TODO reduce the number of options.
['10'] = '%d',
dec = '%d',
LChex = '%04x',
LC16 = '%04x',
['Lower Case Hex'] = '%04x',
['Lower Case 16'] = '%04x'
})[frame.args['base']] or '%04X'
if codepoint <= 0xFFFF then -- NB this also returns lone surrogate characters
return format:format(codepoint)
elseif codepoint > 0x10FFFF then -- There are no codepoints above this
return ''
end
codepoint = codepoint - 0x10000
bit32 = require('bit32')
return (format .. ' ' .. format):format(
bit32.rshift(codepoint, 10) + 0xD800,
bit32.band(codepoint, 0x3FF) + 0xDC00)
end
return p