How to organize localization string resources

internationalizationlocalizationnamingresources

We're developing a large application, consisting of many small packages. Each package has its own set of resource files for localization.

What's the best approach to organizing and naming the localization strings?

Here are my thoughts so far:

Handling duplicates

The same text (say, "Zip code") may occur multiple times within a given package. Programming instinct (DRY) tells me to create a single string resource shared by all occurrences.

Then again, a translator may want to choose a long translation ("Postleitzahl") in some places and a shorter one ("PLZ") in places with less space. Or we may decide to append a colon to some occurrences ("Zip code:"), but not to others. Or we may require a different capitalization ("zip code") in some places. All these arguments point to creating one resource per usage, even if their contents are identical.

Naming

If we aim to eliminate duplicates, it makes sense to name resources by content, maybe hinting at the kind of usage via prefix. So we may have labelOK = "OK", messageFileTooLarge = "The file exceeds the maximum file size.", and labelZipCode = "Zip code".

Naming by content has the advantage of handling format arguments naturally: The resource messageFileHas_0_MBWhileMaximumIs_1_MB clearly takes two formatting arguments, the actual file size and the maximum file size.

If we allow duplicates, however, naming by content alone doesn't make sense. In order to get unique resource names, we must somehow include the place of usage in the resource name. That works for graphical controls, although the identifiers tend to get a bit long: fileSelectionConfirmationButtonText = "OK", customerDetailsTableColumnZipCode = "Zip Code". However, for non-visual code files, it gets harder. How do you name a specific usage of a string if you don't know where it will eventually be displayed? By code file and function name? Seems rather clumsy and brittle to me.

All in all, I'm leaning toward allowing duplicates, but I'm struggling to find a consistent naming scheme that supports this.

Edit: This question has two aspects: How to organize resources (DRY vs. duplicates) and how to name them. So far, the answers have concentrated on the first aspect. I'd appreciate some feedback regarding naming conventions!

Best Answer

I would accept duplication whenever you cannot be absolutely sure the meaning is exactly the same in all cases a certain string is used.

Even if two labels always contain the same string in English (or your native tongue) they will not necessarily be the same in all languages. Accepting duplication may give you (or rather the translators) the flexibility needed to handle such situations.

As an example: Consider a label "Condition", which - depending on context - might get translated to "Zustand" or "Bedingung" in German (among lots of other possible translations).

Related Solutions

Isn’t number localization just unnecessary

Why should non-Anglos have to decode dates, numbers, etc. while Anglos can just read them? Numerical and date localization is absolutely necessary if you want non-Anglos to feel, you know, welcome as users and customers. Why should a German user have to work out what your number is instead of, you know, getting it in his or her own language's format?

Further, your view of number formats (and dates: q.v. below) is hopelessly simplistic. For example undoubtedly you'd find numbers like 1,234,567 "natural" and "obvious" and "logical" ... but what about people who come from cultures with myriad-based numbering schemes? My students (Chinese), for example, are always confused about numbers over 1000 because they group numbers differently. A more "natural" grouping for their thought processes (which include a myriad above the thousand point) is 123,4567. Further there are many contexts in which the European number systems in general are simply not suited. It would be nice in those circumstances to be able to write the all-Chinese 一百二十三万四千五百六十七 or even various hybrid systems that are in common use here.

Your idea for dates is wrong-headed too. You've correctly pointed out how 01/02/03 is ambiguous (if only because Americans refuse to comply with standards on dates) and suggest instead that Feb 3 2001 is unambiguous. I'm not sure, however, if you've noticed something there. It's unambiguous and unambiguously English. Going back to my students, I'm pretty damned certain that they'd far prefer to see 2001年2月3日 (or even 二〇〇一年二月三日) which is both unambiguous and, get this, something they can read without having to decode.

The bottom line on i18n and l10n: Do you want money and/or users? You make what your users want. Your users want things in their own language, not in yours. End of story.

edited to add

It gets even worse than myriad-based systems. Take a look at Indian numbering for this lovely progression:

1
10
100
1000
10,000
1,00,000
10,00,000
1,00,00,000

...and so on up to:

100,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,000

See that grouping by three at the end? See that grouping by two after the grouping by three? See the sudden reintroduction of a group by three again?

further edited to add (I just can't keep off this subject it seems!)

Even the assumption of decimal number systems being universal is wrong. There are native numbering systems that are 4-based, 5-based, 8-base (octal), 10-based (decimal), 12-based, 20-based and even 60-based. These are all systems which have been in active use by real people (as in not made up for science fiction stories). Not all of these are still living (although we can see, for example, vestiges of 12-, 60-based numerical systems in English terminology).

As for dates, let us not forget the lunar calendars still in active use in much of the world. The Muslim world tends to use a lunar calendar where the dates can drift throughout the whole year while the Chinese use one with a complicated system that keeps the dates never more than a month away from true. (And that's just naming two off the top of my head.)

How to guys handle translation for software localization

We used to work with a translation agency which did the translation for our enterprise product on a continuous basis.

To go there you would need some sort of a tracking and reporting system for all of your textual resources. New texts should automatically go to the translation queue so that it is easy to keep track of what is pending translation. Reporting wrong or low quality translations must be there too. If you have it you could either build a simple web interface for the agency to access the stream of the pending work continuously or have a technical possibility to export the next bulk of resource items, send them to the agency then import the result.

It's not really feasible to entrust this task to a random bunch. The quality and predictability will largely vary. Even with a trusted and experienced agency there are often issues:

They do miss the usage context if they see a single short string. You should definitely have additional attributes to allow commenting each textual resource to help them understand the environment but it naturally means more work for you as a developer.
They lack industry knowledge and translations unfortunately suffer from this. There is hardly any solution short of looking for an agency with certain industry knowledge (hard) or perhaps employing and educating a person in-house.
They have little understanding of <html> or <xml> tags in resources as well as of {variablePlaceholders} therefore they regularly break the software. It is either out of lack of attention or perhaps because the people are changing there continuously and the knowledge of these things is not transferred to the next executor.

Best Answer

Related Solutions

Isn’t number localization just unnecessary

How to guys handle translation for software localization

Related Topic