Designing URLs for Multilingual Web Sites

Friday, January 12, 2007

Usually when starting a new web site, a company will design the site for use in only one language. As companies grow, however, it is often prudent to internationalize the site and make it accessible in many languages. If the site is to become a large commercial success in countries around the world, this step is necessary and crucial.

Still too often companies show their naïvety by failing to prepare for internationalization and later incur large costs when the entire site has to be retrofitted to support multiple languages. Of course if one were creating a site with something like Django, the site would be completely ready for translation into other languages and would require no retrofit down the line.

One interesting aspect of designing a multilingual web site is deciding how to represent the language choice in the URLs. In this article I want to explore several possible methods for indicating a language choice for a given resource with a URL on a multilingual web site and decide on the best one.

Below is a non-exhaustive list of methods for selecting a language based on my own brainstorming and collaboration with my friend Ted. All methods involve accessing the resource at /bar/baz on the domain example.com.

Language-specific Sub-domains (1)

One can use two-letter language codes as sub-domains on the main site’s domain name.

Examples

  • en.example.com/bar/baz
  • de.example.com/bar/baz

Evaluation

  • It makes further use of sub-domains for other purposes cumbersome and unpleasant — think of api.de.example.com vs. de.api.example.com and the other restrictions it puts on future use of sub-domains in general.
  • Requires DNS management, which could get complicated.
  • Clean and simple.
  • Allows for direct permalinks.

Modified Directory Structure (2)

The directory structure of the resources on the site can contain the language code, whether these are actual directories or achieved through some programmatic means.

Examples

  • example.com/en-US/bar/baz
  • example.com/en-GB/bar/baz
  • example.com/de/bar/baz

Evaluation

  • Not very semantic — the resource being accessed at /bar/baz is not underneath the language code in a hierarchical sense.
  • May be difficult to maintain if site content is split apart in mirrored sections under each language directory.
  • Aesthetically confusing and ugly.
  • Allows for direct permalinks.

Language Code in Querystring (3)

It’s possible to just set a language variable in the querystring of the URL.

Examples

  • example.com/bar/baz?lang=en-US
  • example.com/bar/baz?lang=en-GB
  • example.com/bar/baz?lang=de

Evaluation

  • Issues with proper caching of pages.
  • Search engines will not store the querystring in their links to the resource.
  • Not semantic when /bar/baz is considered as a static resource — one shouldn’t pass querystring variables for processing to a static resource.
  • Messy and difficult to maintain when using other (properly employed) querystring variables.
  • Ugly.
  • Allows for direct permalinks.

Country-specific TLDs (4)

A different domain for each country representing a language can be purchased.

Examples

  • example.us/bar/baz
  • example.co.uk/bar/baz
  • example.de/bar/baz

Evaluation

  • Expensive — recurring registration fees.
  • Country-specific TLDs speak more to localization than simple language selection — there are several countries where more than one language could be selected; also brings up geographic concerns with respect to server locations, etc.
  • May be difficult to manage inter-language linking (e.g. example.us/bar/baz to example.de/bar/quux).
  • Requires DNS management.
  • Allows for direct permalinks.

Pure Cookie-based Preference (5)

One could just have a UI element on the site itself to allow the selection of a language preference which is then stored in a cookie on the user’s computer. This removes the need for representation of the language in the URL itself.

Examples

  • example.com/bar/baz with cookie (lang=en-US)
  • example.com/bar/baz with cookie (lang=en-GB)
  • example.com/bar/baz with cookie (lang=de)

Evaluation

  • Won’t work with user agents that don’t support cookies.
  • Doesn’t allow for direct permalinks to a resource in a specific language.
  • Invisible and therefore clean.

Use of Accept-Language HTTP Header (6)

Most user agents (such as web browsers) send an HTTP header, Accept-Language, that will indicate which natural languages the agent is ready to support. Often, international users will have their native language set first in this list of languages sent by the agent.

Examples

  • example.com/bar/baz with Accept-Language: en-us or en
  • example.com/bar/baz with Accept-Language: en-gb
  • example.com/bar/baz with Accept-Language: de

Evaluation

  • Won’t work with user agents that don’t send the Accept-Language HTTP header.
  • Makes the control of and transitioning between languages very difficult or impossible from the site’s point of view.
  • Won’t cater to e.g. German users who wish to read a site in English.
  • Also doesn’t allow for direct permalinks to a resource in a specific language.
  • Invisible and therefore clean.

Semi-colon Path Parameter at End of Path (7)

According to RFC 3986 - URI: Generic Syntax § 3.3, a path in a URI (or a URL) can contain semi-colons to specify path parameters and values to each segment of the path. One could use this in the simplest way possible to tack on a language code to the end of the path to a resource.

Examples

  • example.com/bar/baz;en-US
  • example.com/bar/baz;en-GB
  • example.com/bar/baz;de

Evaluation

  • RFC 3986 recommends using semi-colon parameters along with values, like example.com/bar/baz;lang=en-US, which is uglier and seems unnecessary.
  • Use of semi-colon parameters is unfamiliar to many developers and users, so therefore might fail in some agents or URL libraries.
  • Semantically rich — a resource is being retrieved at /bar/baz and extra information is being supplied via a parameter intended for such a purpose. This is different from the querystring approach because the semi-colon parameterized URL will be used as-is by search engines and other consumers of URLs.
  • Allows for direct permalinking.
  • Lends itself to easy generic full-stack handling in web application code — the language codes can be stripped on the way in and added on the way out, so as not to interfere with the internal URL handling in the least.

Comma Path Parameter at End of Path (8)

RFC 396 also states that one can use a comma-delimited path parameter on segments of the path in the URL. Furthermore, it recommends that these parameters be used when only a value (and not a key) needs to be supplied.

Examples

  • example.com/bar/baz,en-US
  • example.com/bar/baz,en-GB
  • example.com/bar/baz,de

Evaluation

  • Also unfamiliar to developers and users, though could cause fewer problems with URL libraries because of its slightly wider usage.
  • Recommended for precisely this use by RFC 3986.
  • Semantically rich for the same reasons as the semi-colon approach.
  • Allows for direct permalinking.
  • Also lends itself to easy generic full-stack handling in web application code.

Update (2007-02-15):

Mike Schinkel has outlined five methods which fall into two distinct categories for my purposes: including the language code in the resource name (now #9) and using a modified directory structure (my #2).

Language Code in the Resource name (9)

Given the resource at /bar/baz, the resource name is baz. This method suggests modifying the resource name to include the language code, whether it is added as a prefix or suffix and what the delimiter is (as long as the delimiter is a valid URL path character) are inconsequential.

Examples

  • example.com/bar/baz.en-US
  • example.com/bar/baz-en-GB
  • example.com/bar/de.baz

Evaluation

  • Poor semantics: when one wishes to request the resource at /bar/baz, the path component of the URL should be exactly /bar/baz, not something entirely different that must be modified to reflect the true resource being accessed.
  • Not differentiable as a resource from the same resource in a different language. In other words, /bar/baz.en-US is not an entirely different resource from /bar/baz.de, but rather a different way of serving the same resource, so the URL path components should be identical.
  • Allows for direct permalinking.

This added method does not change my conclusions below.

As I said before, this list is non-exhaustive and any additions should be noted in the comments on this article. There isn’t going to be any “silver bullet” approach that gives a perfect clean solution to the entire dilemma, but we can learn from the methods that seem to solve the problem in the best way with the most ease. It’s also possible that a combination of some of these techniques together will achieve the best overall solution. Before making a decision, it seems prudent to see what other sites are using that have already solved this problem.

Case Studies: High-profile Multilingual Web Sites

  • amazon.com: Country-specific TLDs (4)
  • google.com: Modified Directory Structure (2), Country-specific TLDs (4)
  • yahoo.com: Language-specific Sub-domains (1)
  • ebay.com: Country-specific TLDs (4)
  • wikipedia.com: Language-specific Sub-domains (1)

These observations were made quickly and without detailed investigation. It is quite possible that these sites use combinations of (5) and (6) as well, but those are harder to test for. As with the list of methods, feel free to comment with other web sites and the methods they use.

After examining this list of possibilities, it is clear to me that Comma Path Parameter at End of Path (8) is the preferred method and I will certainly use that in future projects of my own. Semi-colon Path Parameter at End of Path (7) is similar, but there is no compelling reason I can see for using it over (8). After those two, the closest one is probably Language-specific Sub-domains (1). This method is used with great success by both Yahoo! and Wikipedia. It does have its fair share of issues, however, and that’s what takes it out of the running for my ideal solution.

written by Brad Fults

Add your thoughts | Trackback URL

Archived at: http://h3h.net/2007/01/designing-urls-for-multilingual-web-sites/

19 responses

  1. dret

    thanks for this great posting! a very nice summary of the possible ways of how multilingual resources can be identified today. generally, i am surprised to see how little is being said about this important question. the w3c publishes all kinds of documents about how to use markup to identify languages, but as far as i can see is silent on the issue of how to design uris for multilingual resources.

  2. Mike Schinkel

    Great post. See my blog post in response (scheduled to be published on Monday.)

    Also, may I ask you to consider doing this on posts with heading tags?

  3. Brad

    I’ll await your reply, Mike.

    Also, I’ve given all of the headers id attributes and added some CSS to show the fragment identifier when a header is hovered over (only in CSS2-capable browsers of course).

  4. Zbynek Winkler

    A couple years ago I decided to use example.com/bar/baz/en as the prefered way to go for our two-language website http://robotika.cz/. It allows for permalinks (good for search engines) and is fairly simple to implement. I also implemented Accept-Language parsing to select the preferred language when one is not set in the url a redirect to it (with the option to switch later). Also not all content is available in both languages (en and cs) so we decided to show even non matching articles, only in dimmer colors.

  5. Mike Schinkel

    I just noticed where you evaluated #9, you stated if violated the URI Opacity Axoim. Actually, it doesn’t and that axiom has been frequently misunderstood. Whereas it violates to make assumptions about somebody else’s URLs, it’s perfectly ok to design your own URLs in a specific manner and then publish the structure of your URLs which would be the case with #9.

  6. Brad

    That makes sense. I’ve updated it to reflect as much.

  7. Joel

    Methods 7-9 are essentially identical. The use of commas and semicolons as delimiters for path parameters is not exactly recommended, as you suggest, but rather offered as a common practice by some URI producers. As the RFC states, “Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax.” As far as RFC 3986 goes, commas, semicolons, dashes, and any other character you might use as a means of specifying a language inside a URI’s path component are all semantically meaningless to the outside world.

  8. Webmaster Libre | URL para sitios en varios lenguajes

    [...] Lo que, en principio, parece una tontería admite múltiples posibilidades. Todas con sus pros y sus contras, y todas dignas de ser bien estudiadas antes de tomar una decisión. En h3h, Brad Fults, ha hecho una revisión bastante exhaustiva de los 9 métodos disponibles: Designing URLs for Multilingual Web Sites [...]

  9. Speedlinking « Blue MUIOMUIO - Por Mario Andrade

    [...] Designing URLs for Multilingual Web Sites [...]

  10. dret

    Naming Language Variants” is what i have come up with as the explanation i am going to give my students about how to name language variants. after thinking about why there is no really good solution for a while, i came to the following conclusion:

    for a good solution, there would have to be some interaction between http uris and http, specifically, between http uris and http content negotiation. if parts of a uri could become a part of content negotiation, this would be the ideal solution. so: for http/1.2, i hope to see a new part of the spec that tells uri designers how to use “,” or “;” separated parts in a http uri which will then be used as input for content negotiation. this would also allow browsers to recognize these parts in the uri and provide smart functionality, such as providing me a choice whether i want the uri to be retrieved in the language which is specified in the uri, or in the language which i have set in my browser preference.

    any comments about this idea would be greatly appreciated! and thanks again for the nice list of variants!

  11. Brad

    Erik (dret),

    I completely agree about the possible interaction between content HTTP URLs and HTTP content negotiation.

    Thanks for the insight and the clearly organized slides on the subject.

  12. Rohan

    Thank you for this nice analysis of the options for specifying language.

    Having reflected on the matter, I believe that for many sites denoting the language near the top of the URI (e.g. test.com/en/home) may make more sense because effectively the user not simply viewing a resource in English, but rather is navigating the “ENGLISH VERSION OF TEST.COM” (which may even involve differences apart from language).

    Behind the scenes it’s up to the programer to decide whether languages use the same resources (e.g. templates) with different translations or a completely different resources serving the same purpose. For example european languages may share the same templates with different string translations but eastern languages may require completely different templates.

    From an html navigation perspective it is also seems to me to be more convenient to be able to navigate from resource to resource without having to worry about tagging and parsing the language code at the end of every URI.

    Using the “version of the site” perspective, you could even imagine someone viewing a particular resource in a different language than the current site language and it is here that I consider that a “end of URI” tag is most appropriate, e.g.:

    library.org/en_US/descartes,fr_FR

    Here the user is using the “system” in English but wishes to view a specific resource in French (perhaps because this is the original language of the resource).

    Just my thoughts, no expertise claimed, look forward to contradictory opinions.

  13. Brad

    Rohan,

    You’ve brought up a very interesting and important point. I hadn’t previously considered the “English version of the site” distinction from the narrower “give me this resource in this language” one.

    It’s worth noting that the subdomain approach (en.example.com) will also work for the broader “English version of this site” designation.

  14. Karl

    Does anyone know if crawlers repeat their walk with different “Accept-Language” preferences? If so, using method #5 above, would search engines index the same language-independent url for each language crawled and would that url then be searchable using both of those languages?

  15. dgurba

    Great article and *very* informative. I do see your point to (1) … it can confuse issues.

    I have a small quibble to an example in (1)

    api.business.com

    would actually better be seen as:

    business.com/api/v1/

    APIs can evolve over time but should not change between versions. By using a subdomain to point to the api you loose the ability to distinctly say this is version 1 of the api … or version 10 of the api.

    I’d rather use:
    dev.business.com/api/v1
    dev.business.com/api/v3/method1

    or even:
    code.business.com/api/v1
    code.business.com/schemas/authorization

    etc…

    I just pointing out that using api.business.com would be a poor choice for declaring the point of entry into an api.

    very informative article :)

  16. Denis au fil du web » links for 2007-10-20

    [...] h3h.net - Designing URLs for Multilingual Web Sites Comment construire des URL dans un environnement multilingue ? (tags: url usability multilingual) [...]

  17. James

    There is another scheme that I like:

    http://mysite.com/bar/faz.en-us.html
    http://mysite.com/bar/faz.de.html

    These can be used in conjunction with your #6 above (’accept’ header) as an override in the case of a person in Germany who wants to read the site in English.

    To me, this naming scheme makes sense, since language is very close to format. I can request the same resource as English HTML, or Japanese XML, or …

  18. Pascal Van Hecke - Daily Links » 2007 » October » 20

    [...] Designing URLs for Multilingual Web Sites Language-specific Sub-domains Modified Directory Structure Language Code in Querystring Country-specific TLDs Pure Cookie-based Preference Use of Accept-Language HTTP Header Semi-colon/comma Path Parameter at End of Path Language Code in the Resource name (tags: i18n internationalisation internationalization languages localisation urls linkstructure seo) [...]

  19. Mat

    Very interesting article. I’m designing a multilingual website myself and have tried a few of the techniques mentioned and have been quite disappointed with most of them.

    So i thought I’d try and cheat a bit and use #8 with mod_rewrite when I came to the realization that a url such as
    * http://www.example.com/bar/baz,en-US
    is visualy similar to something like
    * http://www.example.com/bar/baz/en-US

    There appears to be no technique mentioned that would use this type of structure. Although (2) comes close, I think that this structure is much more appropriate for designing multilingual websites, but it really really depends.

    Having the language code higher in the path structure makes it increasingly difficult to share resource which are language independent.

    In any case, the issue is very complicated and i was very pleased to find this. Hopefully there’s more info out there!

  20. Comment Preview

Leave a comment

Comments are posted at the discretion of the site owner. Please try to be respectful, insightful and otherwise useful to society as a whole.

(X)HTML is allowed. You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <blockquote cite=""> <cite> <code> <dfn> <em> <kbd> <q cite=""> <samp> <strike> <strong> <sub> <sup> <var>