Jump to content

Module talk:DecodeEncode

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by DePiep (talk | contribs) at 21:14, 7 February 2023 (Bug report: bad decoding of U+03B5 ε (epsilon): <!-- in THIS line: using "&" escape (so copy/paste works). In the NEXT (live) line: using "s=Xε1Xε2X" [literal] -->). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Bug report: bad decoding of U+03B5 ε (epsilon)

About U+03B5 ε GREEK SMALL LETTER EPSILON (&epsi; &epsilon;)

  • Issue: after resolving HTML entity &epsilon; by mw.text.decode(), the plain character is not found by mw.ustring.gsub(). No issue with alternative HTML entity &epsi;. &epsi; good, &epsilon; bad.
Report limitations: Original report and bug reproduction is at enwiki Module talk:DecodeEncode, from where en:module:DecodeEncode and en:module:String are used live. At phabricator pseudocode may be used and some "results" may be hardcoded. In-text the escape &amp; is used, not in-function. Lua patterns not used ("no %").
  • To reproduce:
1. Create research string:
X&epsi;1X&epsilon;2X (shows live and unedited as: Xε1Xε2X)
2. Render the string by decode() (as inner function)
3. then on rendered result use gsub() to replace plain character εE: (as outer function)
mw.ustring.gsub( s=(mw.text.decode( s=X&epsi;1X&epsilon;2X, decodeNamedEntities=true ) ), pattern=ε, repl=E ) [is pseudo-code, see note. 21:10, 7 February 2023 (UTC)]
4. Result3 (s&r pattern use ε from Xε1X):
XE1XE2X
5. Result4 (s&r pattern use ε from Xε2X):
XE1XE2X
  • Expected: XE1XE2X (only one character ε exists)
{{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s=X&epsi;1X&epsilon;2X}}|pattern=ε|replace=E|plain=true}}
→ XE1XE2X
-DePiep (talk) 21:10, 7 February 2023 (UTC)[reply]

Workaround A, ad hoc

Workaround A, ad hoc: add innermost function to first replace in the research string &epsilon;&epsi;:

A1: {{#invoke:String|replace|source={{#invoke:DecodeEncode|decode|s={{#invoke:String|replace|source=X&epsi;1X&epsilon;2X|pattern=&epsilon;|replace=&epsi;|plain=true}}}}|pattern=ε|replace=E|plain=true}}
XE1XE2X

Workaround B, in module (THIN SPACE example)

Workaround B: early in :en:module:DecodeEncode, replace &epsilon;&epsi;

About THIN SPACE: it looks like character U+2009 THIN SPACE (&thinsp; &ThinSpace;) has a samilar issue. &ThinSpace; good, &thinsp; bad.

Currently in code:

function p._decode( s, subset_only )
	local ret = nil;
    s = mw.ustring.gsub( s, '&thinsp;', '&ThinSpace;' ) -- Workaround for bug: &ThinSpace; gets properly decoded in decode, but &thinsp; doesn't.
	ret = mw.text.decode( s, not subset_only )
	return ret
end

In en:module:DecodeEncode/sandbox, I have coded a similar handling of EPSILON:

module:DecodeEncode, module:DecodeEncode/sandbox diff
function p._decode( s, subset_only )
	local ret = nil;
	-- U+2009 THIN SPACE: workaround for bug: HTML entity &thinsp; is decoded incorrect. Entity &ThinSpace; gets decoded properly
	s = mw.ustring.gsub( s, '&thinsp;', '&ThinSpace;' )
	-- U+03B5 ε GREEK SMALL LETTER EPSILON: workaround for bug (phab:...): HTML entity &epsilon; is decoded incorrect for gsub(). Entity &epsi; gets decoded properly
	s = mw.ustring.gsub( s, '&epsilon;', '&epsi;' )
	ret = mw.text.decode( s, not subset_only )
	return ret
end
  • /sandbox tests:
B. {{#invoke:String|replace|source={{#invoke:DecodeEncode/sandbox|decode|s=X&epsi;1X&epsilon;2X}}|pattern=ε|replace=E|plain=true}}
B1. ResultB1 (s&r pattern use ε from Xε1X): XE1XE2X
B2. ResultB2 (s&r pattern use ε from Xε2X): XE1XE2X

I propose to edit the module along this way.

Workaround C (mw, Lua)

Changes in mw, Lua: I have not idea.