Module talk:DecodeEncode
Appearance
Bug report: bad decoding of U+03B5 ε (epsilon)
About U+03B5 ε GREEK SMALL LETTER EPSILON (ε ε)
- Issue: after resolving HTML entity
ε
bymw.text.decode()
, the plain character is not found bymw.ustring.gsub()
. No issue with alternative HTML entityε
. ε good, ε bad.
- Report limitations: Original report and bug reproduction is at enwiki Module talk:DecodeEncode, from where en:module:DecodeEncode and en:module:String are used live. At phabricator pseudocode may be used and some "results" may be hardcoded. In-text the escape
&
is used, not in-function. Lua patterns not used ("no%
").
- To reproduce:
- 1. Create research string:
Xε1Xε2X
(shows live and unedited as: Xε1Xε2X)
- 2. Render the string by
decode()
(as inner function) - 3. then on rendered result use
gsub()
to replace plain characterε
→E
: (as outer function)mw.ustring.gsub( s=(
[is pseudo-code, see note. 21:10, 7 February 2023 (UTC)]mw.text.decode( s=Xε1Xε2X, decodeNamedEntities=true )
), pattern=ε, repl=E )
- 4. Result3 (s&r pattern use ε from
Xε1X
):- XE1XE2X
- 5. Result4 (s&r pattern use ε from
Xε2X
):- XE1XE2X
- Expected:
XE1XE2X
(only one characterε
exists)
- Note 21:10, 7 February 2023 (UTC): This step 3 is in pseudo-code. To reproduce, use Lua modules module:String and Module:DecodeEncode:
{{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s=Xε1Xε2X}}|pattern=ε|replace=E|plain=true}}
- → XE1XE2X
- -DePiep (talk) 21:10, 7 February 2023 (UTC)
Workaround A, ad hoc
Workaround A, ad hoc: add innermost function to first replace in the research string ε
→ ε
:
- A1:
{{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s={{#invoke:String|replace|source=Xε1Xε2X|pattern=ε|replace=ε|plain=true}}}}|pattern=ε|replace=E|plain=true}}
→ - XE1XE2X
Workaround B, in module (THIN SPACE example)
Workaround B: early in :en:module:DecodeEncode, replace ε
→ ε
About THIN SPACE: it looks like character U+2009 THIN SPACE (   ) has a samilar issue.   good,   bad.
Currently in code:
function p._decode( s, subset_only )
local ret = nil;
s = mw.ustring.gsub( s, ' ', ' ' ) -- Workaround for bug:   gets properly decoded in decode, but   doesn't.
ret = mw.text.decode( s, not subset_only )
return ret
end
In en:module:DecodeEncode/sandbox, I have coded a similar handling of EPSILON:
function p._decode( s, subset_only )
local ret = nil;
-- U+2009 THIN SPACE: workaround for bug: HTML entity   is decoded incorrect. Entity   gets decoded properly
s = mw.ustring.gsub( s, ' ', ' ' )
-- U+03B5 ε GREEK SMALL LETTER EPSILON: workaround for bug (phab:...): HTML entity ε is decoded incorrect for gsub(). Entity ε gets decoded properly
s = mw.ustring.gsub( s, 'ε', 'ε' )
ret = mw.text.decode( s, not subset_only )
return ret
end
- /sandbox tests:
- B.
{{#invoke:String|replace|source={{#invoke:DecodeEncode/sandbox|decode|s=Xε1Xε2X}}|pattern=ε|replace=E|plain=true}}
- B1. ResultB1 (s&r pattern use ε from
Xε1X
): XE1XE2X - B2. ResultB2 (s&r pattern use ε from
Xε2X
): XE1XE2X
I propose to edit the module along this way.
Workaround C (mw, Lua)
Changes in mw, Lua: I have not idea.
- I propose to consider module editing along § Workaround B. -DePiep (talk) 12:26, 4 February 2023 (UTC)