Author Topic: HANDLE_CHARSETS  (Read 11524 times)

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
HANDLE_CHARSETS
« on: July 19, 2014, 10:30:54 AM »
HANDLE_CHARSETS is a function that allows you to call the iconv program to convert your mod's TRA files into UTF-8 at runtime. The usage is hopefully straightforward and spares you from worrying about the implementation yourself. You need to include a Windows copy of iconv in your mod yourself. OS X and GNU/Linux come with iconv as part of the base system.

The function can be had in a stand-alone form here. It will also be included in the next WeiDU. Bear in mind that you will be excluded from any future updates so long as you use the stand-alone version, unless you update it yourself.

Documentation:
Quote
HANDLE_CHARSETS: runtime-convert TRA files into UTF-8 in a safe and easy manner. This is an ACTION function.

This function supports Windows, OS X and GNU/Linux. If the game is BG: EE or BGII: EE, TRA files are encoded into UTF-8 so the text can be installed without causing problems. HANDLE_CHARSETS needs to be used before any text is installed and is compatible with AUTO!TRA and all other methods of loading TRA files.

Conversion is handled by the program iconv. The program is available as part of the base system on OS X and GNU/Linux but a Windows version needs to be included in your mod. A Windows version can be downloaded here.

In order to function, HANDLE_CHARSETS needs to know a few things. First, you need to specify where you keep your TRA files. You do this with the variable tra_path. Second, HANDLE_CHARSETS needs to know where the Windows version of iconv is located. You do this with the variable iconv_path. Third, HANDLE_CHARSETS needs to know which character set the TRA files are in. Note that they can only be converted into UTF-8 and cannot already be in UTF-8. You provide this information with charset_table or tell HANDLE_CHARSETS to try to infer this by itself with infer_charsets. Lastly, HANDLE_CHARSETS needs to know which TRA files to convert and whether any of them should be reloaded. You can do this with noconvert_array, convert_array and reload_array.

  • INT_VAR infer_charsets to whether HANDLE_CHARSETS should try to infer which character set the TRA files are encoded in. It uses the contents of the LANGUAGE variable. If the contents of the variable can be recognised, HANDLE_CHARSETS will use the character set used by the localised version of BG2 for this language. If the contents of the variable cannot be recognised, or if the TRA files use a different character set than the expected one, HANDLE_CHARSETS will fail. Defaults to 0.
  • STR_VAR tra_path to the path where your mod's language directories are located. %tra_path%/%LANGUAGE% should be a valid directory containing TRA files.
  • STR_VAR iconv_path to the path where iconv.exe is located. Defaults to %tra_path%/iconv.
  • STR_VAR charset_table to the name of an associative array where the keys are the names of your language directories and the corresponding values are the character sets used by the respective language.
  • STR_VAR noconvert_array to the name of an array indexed by monotonically increasing integers starting from 0. The values should be the names of TRA files that should not be converted into UTF-8. All TRA files in the language directory except the ones listed in noconvert_array will be converted. The .tra file extension is implied and should not be explicitly provided. This variable should not be provided if you also provide convert_array.
  • STR_VAR convert_array to the name of an array indexed by monotonically increasing integers starting from 0. The values should be the names of TRA files that should be converted into UTF-8. Only those TRA files in the language directory which are listed in convert_array will be converted. The .tra file extension is implied and should not be explicitly provided. If this variable is provided, noconvert_array will not be used.
  • STR_VAR reload_array to the name of an array indexed by monotonically increasing integers starting from 0. The values should be the names of TRA files which should be reloaded after they have been converted. You should use this variable for reloading those TRA files loaded by LANGUAGE which should also be converted.

I know, I need to be more detailed about what infer_charset can understand. In short, for the LANGUAGE folder names it understands the English names for the respective languages (french, spanish and so on), plus a handful of aliases or alternatives (polski and castilian, for example). Simplified Chinese and Traditional Chinese are, as of now, only understood as schinese and tchinese, respectively. The inferred character sets are all the (presumed) BG2 defaults, which seem to be the respective Windows code page (the ANSI series) for that language. The full logic can be had in the fl#HANDLE_CHARSETS#WHICH#INFER function and should be very accessible.

Edwin Romance 2.0.6 has been updated to use HANDLE_CHARSETS and is amply commented to hopefully make it easy to apply HANDLE_CHARSETS to your own mod.

Questions, comments, bug reports are gladly heard.

Offline Kulyok

  • Global Moderator
  • Planewalker
  • *****
  • Posts: 6253
  • Gender: Female
  • The perfect moment is now.
Re: HANDLE_CHARSETS
« Reply #1 on: July 19, 2014, 03:19:49 PM »
That's awesome and I'm totally going to steal this code from another mod as soon as I see that very mod using it! Thank you so much!

Offline Isaya

  • Planewalker
  • *****
  • Posts: 47
Re: HANDLE_CHARSETS
« Reply #2 on: July 27, 2014, 10:28:37 AM »
I may be a bit late to the party as I just discovered this thread.

Congratulations, Wisp, your proposal looks like it covers the most typical use a modder could do in a very simple and elegant way.
However I have a few cases to report that don't seem to be covered by the current function. I'm not sure, however, that you should change anything. More likely the mod that don't fit could be adjusted to fit.

Would it be possible to add some extra extension or be allowed to specify one in the file names (so that ".tra" is not added in that case) without making the code much more complex? Currently BG2 Tweaks uses two tpa files that contain texts for a replace action intending to update item descriptions, for instance. In French, these texts use special characters and they are likely to crash BGEE if not converted (I don't remember if I checked specifically with item descriptions).
The mod could probably be adjusted to move the texts used in the REPLACE in a tra file and write a unique tpa for all languages.

Another particular case I can think of is Secret of Bonehill that use a sub-folder of %LANGUAGE% for tra files. Secret of Bonehill also uses %LANGUAGE% folder for the tra files used for setup (so the reload can still work as intended for these files). It looks like the ACTION_BASH command used will only find them in the %LANGUAGE% directory.
I assume such case can be covered by specifying the list of files in their respective folders in %convert_array%, provided the "/" in "folder/file" does not break the code. Or maybe moved them all in the %LANGUAGE% folder.

I know another mod, a french kit mod, that has all its tra files in a sub-folder of %LANGUAGE%. It has been adapted to EE by including two sets of files, so that case is not really an issue. I don't know if there are other mods using such organisation.

Finally I was thinking of BP-BGT-Worldmap as another special case due to the additional worldmap directory. You know it better that I, so I guess you probably thought of it as well. ;)


Alternatives for language names

In the same spirit as for "polski", which I think is native way of naming the language, here are those names for other languages :
French : francais
Spanish : espanol or castellano
German : deutsch
Italian : italiano

I can confirm those names are used in some mods, Kiara-Zaiya and BGQE to name a few.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #3 on: July 27, 2014, 12:34:38 PM »
Would it be possible to add some extra extension or be allowed to specify one in the file names (so that ".tra" is not added in that case) without making the code much more complex? Currently BG2 Tweaks uses two tpa files that contain texts for a replace action intending to update item descriptions, for instance. In French, these texts use special characters and they are likely to crash BGEE if not converted (I don't remember if I checked specifically with item descriptions).
The mod could probably be adjusted to move the texts used in the REPLACE in a tra file and write a unique tpa for all languages.
I'll look into it. But speaking generally, if you have translated or translatable text, it goes into a TRA file or you are doing it wrong. In the specific case of BG2 Tweaks, that will need to be fixed before it's compatible with any iconv approach, as e.g., Polish and French use different encodings and you can't correctly iconv a file containing multiple encodings.

Quote
Another particular case I can think of is Secret of Bonehill that use a sub-folder of %LANGUAGE% for tra files. Secret of Bonehill also uses %LANGUAGE% folder for the tra files used for setup (so the reload can still work as intended for these files). It looks like the ACTION_BASH command used will only find them in the %LANGUAGE% directory.
I can make it a recursive search that converts all files in subdirectories as well.

Quote
French : francais
Spanish : espanol or castellano
German : deutsch
Italian : italiano
I'll add them.

Thanks.
« Last Edit: July 27, 2014, 12:35:45 PM by Wisp »

Offline Isaya

  • Planewalker
  • *****
  • Posts: 47
Re: HANDLE_CHARSETS
« Reply #4 on: July 27, 2014, 06:28:53 PM »
I'll look into it. But speaking generally, if you have translated or translatable text, it goes into a TRA file or you are doing it wrong. In the specific case of BG2 Tweaks, that will need to be fixed before it's compatible with any iconv approach, as e.g., Polish and French use different encodings and you can't correctly iconv a file containing multiple encodings.
I think my words were misleading. I didn't mean to have everything in a single file for all languages (hence various encodings) but instead a tpa using references to tra files for languages.
For instance, there is a file named arcane_descripts.tpa in each language directory containing the translated equivalent of
Code: [Select]
DEFINE_PATCH_MACRO ~arcane_descripts~ BEGIN

  REPLACE_TEXTUALLY ~\(Weight:[ %tab%]*[0-9]+\)~
  ~\1
Miscast Arcane Magic: +%patch_miscast%%~

END
Replacing with a tra reference would indeed avoid having to use iconv on tpa files
Code: [Select]
DEFINE_PATCH_MACRO ~arcane_descripts~ BEGIN

  REPLACE_TEXTUALLY @1 @2

END
The presence of variables in the text would probably require using REPLACE_EVALUATE instead (I'm not familiar with this).

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #5 on: July 28, 2014, 11:58:18 AM »
Oh, like that. I can easily let the people specify the file extension(s) as a regexp (with "tra$" as the default). Would that be satisfactory?

Offline Isaya

  • Planewalker
  • *****
  • Posts: 47
Re: HANDLE_CHARSETS
« Reply #6 on: July 28, 2014, 04:10:31 PM »
Oh, great! I believe that covering that exception and multi-level directories will leave very few mods with a real need of changing its ways then.

Thank you for all this. Ever since I reported the requirement to convert to UTF8 for BGEE, I've been hoping that WeiDU would end up including a way to automate it. Although the license conflict requires using an external tool, it looks like you have provided a very easy way  to handle it for the typical mod.

Offline Mike1072

  • Planewalker
  • *****
  • Posts: 298
  • Gender: Male
Re: HANDLE_CHARSETS
« Reply #7 on: July 28, 2014, 05:57:46 PM »
I'm not sure I understand the need to convert files other than .tra.

In Isaya's arcane_descripts.tpa example, wouldn't traification be preferred to converting a file that mixes game text with WeiDU commands?

Mods are going to need to be updated in order to use this and it seems to make the most sense that all text with encoding issues should belong in a .tra file.

I know BG2 Tweaks' .tp2 embeds some similar blocks of code for a variety of languages.  That text should be moved into a .tra file if it has the potential to cause problems.
« Last Edit: July 28, 2014, 05:59:31 PM by Mike1072 »

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #8 on: July 29, 2014, 01:10:37 PM »
Oh, like that. I can easily let the people specify the file extension(s) as a regexp (with "tra$" as the default). Would that be satisfactory?
Actually, I was wrong. Forget regexps Regexps suck. It's either traification or some other way of converting non-TRA files (probably some sort of additional array, I'll see what I can do).

Offline Isaya

  • Planewalker
  • *****
  • Posts: 47
Re: HANDLE_CHARSETS
« Reply #9 on: July 30, 2014, 06:40:37 PM »
I only compared french and english but the structure of the tpa files were the same. If no language requires a different script, I agree moving the "from" and "to" texts of the REPLACE in a tra file would be the best solution. I'm wondering if REPLACE_TEXTUALLY will work with a variable in the "to" text though.

As an alternative to using regexp, is it possible to use two HANDLE_CHARSETS calls with a different extension each time ?

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #10 on: July 31, 2014, 03:57:31 PM »
I'm wondering if REPLACE_TEXTUALLY will work with a variable in the "to" text though.
It will. You just SPRINT the traref into a variable as an intermediate step (because REPLACE_TEXTUALLY does not take trarefs as either oldtext or newtext).

Quote
As an alternative to using regexp, is it possible to use two HANDLE_CHARSETS calls with a different extension each time ?
Easy for me and extra work for others. I like it.
« Last Edit: July 31, 2014, 04:00:01 PM by Wisp »

Offline Isaya

  • Planewalker
  • *****
  • Posts: 47
Re: HANDLE_CHARSETS
« Reply #11 on: August 13, 2014, 04:04:38 AM »
I used HANDLE_CHARSETS to update the mod Sarah in order to add French language and make it compatible with BG2EE, taking Edwin Romance as reference. Mike1072 suggested a way to reduce the size of the iconv contribution to the archive in this topic.

Thank you for the explanations. I'll see if I can adjust BG2 tweaks files to move texts into a tra file. It would be easier not to complicate HANDLE_CHARSET with another parameter just for the sake of handling one specific case.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #12 on: August 14, 2014, 10:39:34 AM »
Mike1072 suggested a way to reduce the size of the iconv contribution to the archive in this topic.
Ok.

I forgot to do the recursive thing. I'll have it in before release.

Offline Kulyok

  • Global Moderator
  • Planewalker
  • *****
  • Posts: 6253
  • Gender: Female
  • The perfect moment is now.
Re: HANDLE_CHARSETS
« Reply #13 on: November 11, 2014, 12:21:10 AM »
(raises hand) I'm very impressed with HANDLE_CHARSETS, but I had to start a new topic in Modding Q&A http://forums.pocketplane.net/index.php/topic,29289 , because(with me understanding very little about it, I just copied Isaea's code from Edwin Romance and Tiax, basically) I'm having troubles figuring how to deal with Russian language on my own - it doesn't show in the game(only punctuation signs do). Everything's very good with French(for example), but not my native language. Sigh. :(

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #14 on: November 11, 2014, 02:41:37 PM »
Make sure the character set of your input text is correct. The default character set used by HANDLE_CHARSETS is CP1251 (Windows might unhelpfully call it "ANSI"). If there is a mismatch between the character sets used by HANDLE_CHARSETS and the actual charset being used, you might get something like your problem. You might also want to inspect the converted files (your tra files while the mod installed) to see if they are in valid UTF-8 encoded Russian.

Offline Isaya

  • Planewalker
  • *****
  • Posts: 47
Re: HANDLE_CHARSETS
« Reply #15 on: November 16, 2014, 02:48:08 PM »
There may be another reason. For languages not based on latin alphabet, BGEE uses a special font in the lang/<language>/fonts. It may be that you need to create a russian directory in your BG2EE game by copying the english directory (to get at least dialog.tlk) then copy into it the fonts directory from BGEE (restart from a base installation of BG2EE to get the base dialog.tlk).
Then set the game in russian, tell WeiDU to use russian for the choice of game language (erase weidu.conf if it already exists) and install the mod in russian.

I'm not sure you really need to create the russian language directory. However, in the BGEE forum, someone reported he had to set the game to korean and copy the dialog.tlk from english in order to be able to use a custom font file. So it appears the game engine only looks for a fonts directory for specific languages and doesn't do it for english. At least it did before V1.3 (I think the report I'm refering to is a bit old).

Offline Kulyok

  • Global Moderator
  • Planewalker
  • *****
  • Posts: 6253
  • Gender: Female
  • The perfect moment is now.
Re: HANDLE_CHARSETS
« Reply #16 on: November 17, 2014, 01:00:21 AM »
I will try this on my own game(copy the directory and try to find the fonts, that is), and check if it works - if it does, I'll be recommending this on the forums to Russian users, I guess. Thank you!

(my files seem to be in CR-1251, and the new tras read normally via notepad, btw)

Offline Isaya

  • Planewalker
  • *****
  • Posts: 47
Re: HANDLE_CHARSETS
« Reply #17 on: December 03, 2014, 04:17:53 PM »
I noticed the version included in WeiDU 237 behaves differently than the version previously available as en external file. In the past, only the language selected was converted to UTF-8. The version included in WeiDU 237 seems to convert the tra files for all languages available.
I noticed this while updating BG1 NPC to use HANDLE_CHARSETS. The conversion took a long time on my PC, more than a minute. That triggered to check what was going on. I checked with another mod, Tiax, and WeiDU also converted files for all three available languages.

The backup of the original files is neatly organised, the language name being inserted in the file name, except for english (maybe because it's the first LANGUAGE statement). That makes me wonder if this behaviour is actually intented. Could you tell us the intended behaviour, please, Wisp ? Thank you.

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #18 on: December 03, 2014, 05:24:07 PM »
I noticed the version included in WeiDU 237 behaves differently than the version previously available as en external file. In the past, only the language selected was converted to UTF-8. The version included in WeiDU 237 seems to convert the tra files for all languages available.
I noticed this while updating BG1 NPC to use HANDLE_CHARSETS. The conversion took a long time on my PC, more than a minute. That triggered to check what was going on. I checked with another mod, Tiax, and WeiDU also converted files for all three available languages.

The backup of the original files is neatly organised, the language name being inserted in the file name, except for english (maybe because it's the first LANGUAGE statement). That makes me wonder if this behaviour is actually intented. Could you tell us the intended behaviour, please, Wisp ? Thank you.
Thanks for spotting it. The recursive behaviour is over ~%tra_path%~ rather than ~%tra_path%/%LANGUAGE%~. It's harmless, but as you say, quite wasteful. However, I also found a more serious bug. HANDLE_CHARSETS in 237 converts files even if they are present in noconvert_array, which can cause problems (up to and including mods that can't be uninstalled without manual intervention. Fortunately they merely can't be uninstalled; it's not a game-hosing deal where mods would be removed from the log without actually being uninstalled).

Offline jastey

  • Global Moderator
  • Planewalker
  • *****
  • Posts: 1524
  • Gender: Female
Re: HANDLE_CHARSETS
« Reply #19 on: January 02, 2015, 03:45:55 PM »
One question to HANDLE_CHARSETS, or maybe it's a request.

As far as I can see, with HANDLE_CHARSETS I can no longer load the setup.tra(s) from two different language versions (as only one language version gets converted). What I like to do in my mods is reload the English setup.tra and the then the language version setup.tra, this way the install won't fail if the language version is not complete (for whatever reason).

-I can live without this feature, but it is something I would like to point out, that with HANDLE_CHARSETS my mod is restricted to the one language version the player chose, and I cannot provide a fail-safe option with the English files loaded first to make sure the language version does not fail due to a missing text line. I am working around this for the normal tra files as I copy them all elswhere first, anyway, as i don't like my original tra files being converted and back (as I usually edit them while my mods are installed, so I need them untouched), but this way I can only copy whole tra-files, not load two tra files of the same name to get the missing strings.

-Would it be possible to have the option to let HANDLE_CHARSETS convert e.g. two language versions (e.g. always English and the chosen one) including the possibility to load the files of two language versions? (Or maybe I am the only one wanting this?) EDIT: It would have to be the setup.tra and game.tra etc. only, anyway.
« Last Edit: January 02, 2015, 03:50:33 PM by jastey »

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #20 on: January 03, 2015, 10:29:48 AM »
-Would it be possible to have the option to let HANDLE_CHARSETS convert e.g. two language versions (e.g. always English and the chosen one) including the possibility to load the files of two language versions? (Or maybe I am the only one wanting this?) EDIT: It would have to be the setup.tra and game.tra etc. only, anyway.
You are quite right in reporting this. Thank you. I think I should be able to add a "default_language" variable, which could default to English. The default language would be converted regardless of which language the user chooses, so you would be able to use that for defensive loading of TRA-files. reload_array would also load reload the TRA files of the default language before reloading those of the user's language.

Offline jastey

  • Global Moderator
  • Planewalker
  • *****
  • Posts: 1524
  • Gender: Female
Re: HANDLE_CHARSETS
« Reply #21 on: January 03, 2015, 01:29:36 PM »
This would be great! Thank you in advance!

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #22 on: January 25, 2015, 03:49:24 PM »
I think I should be able to add a "default_language" variable, which could default to English. The default language would be converted regardless of which language the user chooses, so you would be able to use that for defensive loading of TRA-files. reload_array would also load reload the TRA files of the default language before reloading those of the user's language.
This has now been coded. Those with a different default language than English can provide the default_language variable to HANDLE_CHARSETS. Others do not need to do anything and English TRA files will automatically be converted in addition to the TRA files of whatever language the user selects at the start of the installation (files will never be converted twice).

Offline Almateria

  • Planewalker
  • *****
  • Posts: 76
Re: HANDLE_CHARSETS
« Reply #23 on: April 17, 2015, 05:14:52 AM »
OS X and GNU/Linux come with iconv as part of the base system.
Does this mean there's no reason to include iconv in the Linux and OSX versions of a mod?

Offline Wisp

  • Moderator
  • Planewalker
  • *****
  • Posts: 1176
Re: HANDLE_CHARSETS
« Reply #24 on: April 17, 2015, 11:38:40 AM »
Does this mean there's no reason to include iconv in the Linux and OSX versions of a mod?
Correct.

 

With Quick-Reply you can write a post when viewing a topic without loading a new page. You can still use bulletin board code and smileys as you would in a normal post.

Warning: this topic has not been posted in for at least 120 days.
Unless you're sure you want to reply, please consider starting a new topic.

Name: Email:
Verification:
Type the letters shown in the picture
Listen to the letters / Request another image
Type the letters shown in the picture:
What color is grass?:
What is the seventh word in this sentence?:
What is five minus two (use the full word)?: