For my other project, I needed to process some Arabic text that was in HTML file derived from MSWord document.
Everything was going reasonably well, except my regular expressions were not picking section name/numbers sequences in all of the cases, which was causing a problem with the 6-language alignment algorithm.
Normally, I just examine the text visually, determine a new regular expression pattern and that particular problem is solved. This time it was not to be.
When I looked at the text what I saw was the phrase “Section 1٣” with the word Section written in Arabic (right-to-left of course). The problem here is 1٣ which means 13, but with first digit 1 coming from Arabic Numerals set (which is what we use in English language) and the second digit ٣ (3) coming from Arabic-Indic Numerals set (which is what at least some Arab countries use). Confusing, I know. We use their numbers and
they already use somebody else’s. What do they know that we haven’t yet figured out?
Of course this juxtaposition makes no sense. Why would somebody mix the two alphabets, especially in an official document. I contacted the authoring departments and – unbelievably to me – they looked at the document and it was looking correct to them.
I had nothing to go on with, so I left that puzzle unsolved for a couple of weeks. That is until it hit me – they were looking at it in the MSWord, while I was looking at it on the codepoint character level. They had WYSIWYG on and I did not. So that was the difference.
I went looking around the MSWord interface with Arabic enabled and sure enough there was a whole collection of options for Arabic fonts, numbers and other options. And one of them was to display all numbers as Arabic-Indic. So, when that mode is enabled, MSWord will display any digits as Arabic-Indic ones. That answered half of the puzzle of why the original authors could not see the difference. But how did that happen in first place?
My guess is that the original section was copied from somewhere else in the document. The person who worked on that original had the keyboard (not MSWord display) configured to use Arabic numbers and was actually entering all too familiar 1,2,3 but displaying them as ١,٢,٣. Then, the person who copied the section title had a keyboard configured to use Arabic-Indic characters and he/she replaced or added to the section number using her keyboard. It still displayed cohesively, but now had numbers from different numeric systems.
Of course since the documents were designed for printing nobody noticed and really had no reason to care. This issue only becomes important when those documents are used as input for bitext alignment or some other computational processing. Then, and only then, it bites the person trying to make sense out of it.
The lesson here is. WYSIWYG might be good if all you are doing is looking or printing. But if your documents serve as input to other processes as well, WYSIWYG can cause some very non-obvious issues.