Has anyone here *ever* had a use case for toLower() where they actually wanted l...

rkangel · on Aug 16, 2020

This is a classic case of a 'why' code comment being needed. It's obvious what you're doing, but without a 2 line explanation, it's not clear why.

dmurray · on Aug 16, 2020

Seems like it would be even better to put this in its own function with a descriptive name, ascii_tolower or roman_tolower or whatever, that has exactly the semantics you want.

gregmac · on Aug 16, 2020

This is exactly right, is and is a great example of what self-documenting code can be. The function itself could have a bit more explanation but any code calling it is going to be obvious.

The big difference is it looks deliberate, instead of just code written by someone trying to micro-optimize, be very clever, or who just didn't realize tolower() exists. Most people will pause before just replacing it, and likewise it should trigger questions in the PR.

eitland · on Aug 16, 2020

Still warrants a comment to be sure no one concludes the built in is good enough.

randomdata · on Aug 16, 2020

It certainly warrants a test to document what the function is for.

And if that test also happens to validate that the documentation is accurate, that is a nice side benefit.

h0l0cube · on Aug 17, 2020

> It certainly warrants a test to document what the function is for.

You might be being serious, so I'll indulge. How would a test that will never break, for a function that would never be changed (because look at it), that lives in a different part of the directory structure, be worth even one second extra time or thought to write it, over-and-above the descriptive comment that is longer than the code itself, and lives discoverably right next to the code itself?

olliej · on Aug 17, 2020

Because that's how you stop regressions.

The cost of writing a single test case is not more than the cost of diagnosing what change broke your code for Turkish users.

Now there's always a point where maybe the infrastructure for testing that kind of thing doesn't exist, so writing your one "simple" test case takes a bunch of time, but on the plus side, future similar "simple" test cases will be easy at that point. And no one has to track down why your code is broken in turkey.

Many years ago while I worked on IM/IMEs on Mac and windows I spent maybe a week working on code that allowed an IME to be implemented in JS (within the webkit test harness), so it was possible to test, and prevent regressions that kept on being reintroduced by people changing layout and/or editing code in ways that are "obviously correct" for US/English keyboards. The win from that week is many regressions that were caught before they were even committed, and the ability to completely rewrite the text input system to support IMEs on non-mac platforms.

h0l0cube · on Aug 17, 2020

This isn't exactly addressing my point. Never did I say that 'All tests are useless'. Refer to my answers here:

https://news.ycombinator.com/item?id=24183844

xahrepap · on Aug 17, 2020

I've found that writing a unit test to verify it works at all is sometimes loads easier than manually running the app or whatever.

At that point, just leave the test and check it in.

h0l0cube · on Aug 17, 2020

I agree with this, kind of. Though I've found that keeping useless tests around doesn't always have 0 cost, same for any dead code.

Often, though, you can verify that your code works using a REPL.

I'm also a huge fan of how doc-tests in some languages that make the tests are part of the documentation, and neatly 'cohere' them to the function itself. At which point I'm happy to leave them in, as the tests-as-documentation are harder to miss, and are actually instructive.

scubbo · on Aug 17, 2020

Comments can be ignored, moved, misinterpreted. A test asserts correct behaviour. The only way to make the build fail if someone has (incorrectly) replaced the "complex toLower" with the (incorrect) "built-in toLower" is to delete or change the corresponding test - which rings way more alarm bells than a vague recollection of "hey, didn't there used to be a comment around here that said we shouldn't change this?"

Machine-time to run tests is cheap (if it's not in your codebase, then I agree that your benefit calculus may be different) - human cognition and awareness to prevent mistakes is valuable.

h0l0cube · on Aug 17, 2020

Keep in mind, I'm not advocating for 'no testing, ever'. But surely there's a limit to where you say something is so trivial that it won't ever change. And it's with these preconditions I ask the question:

> a test that will never break, for a function that would never be changed

I'll refer you to the OPs low-tech to-lower function, which is reductively simple, and never should be altered, with a comment as to why.

> Comments can be ignored, moved, misinterpreted.

Don't disagree with this. Someone lacking competence and care may ignore a comment, is incapable of comprehension, or just moves things for no reason. This should be caught at review.

Tests aren't infallible either. They can be invalidated, disabled because someone lacking competence decided it was the easiest way to move forward. This should be caught at review.

Edit: I'll address some of your other points more specifically...

> The only way to make the build fail if someone has (incorrectly) replaced the "complex toLower" with the (incorrect) "built-in toLower" is to delete or change the corresponding test

In OP's precise example, a worse thing can happen. They can shrug their shoulders and just use the built-in directly. Human-context business-value explication has more powerful benefit here.

> which rings way more alarm bells than a vague recollection of "hey, didn't there used to be a comment around here that said we shouldn't change this?"

If no-one's reviewing when someone changes the actual code and it's adjacent comments, who's reviewing the changes to the tests?

> human cognition and awareness to prevent mistakes is valuable.

Extra code comes at a maintenance and cognition cost. Maybe one trivial test seems like a minor cost, but how about the maintenance of 1000s of tests that (ought to) always pass?

sedatk · on Aug 16, 2020

C# has a `ToLowerInvariant()` variety for that.

paranoidrobot · on Aug 16, 2020

Which iirc is an alias for ToLower on the en-us locale. (Same for the other C# *Invariant() methods)

GoblinSlayer · on Aug 17, 2020

It calls ToLower(CultureInfo.InvariantCulture);

Locale name is iv-IV.

lmm · on Aug 16, 2020

Comments are unreliable, you should use tooling to fix this forever. E.g. findbugs has a rule for this problem: http://findbugs.sourceforge.net/bugDescriptions.html#DM_CONV...

kentonv · on Aug 16, 2020

Yeah I probably wrote that comment the first few times I did this but it's hard to write it the 50th time.

Maybe I should have my own tolower() function that I can call so I only have to write the comment once but it just feels ridiculous somehow.

Natsu · on Aug 16, 2020

It's far more ridiculous to repeat yourself over and over instead of making a simple function that describes exactly what you want and why.

viraptor · on Aug 16, 2020

> but it's hard to write it the 50th time.

You know you wrote it 49 times before, but the person reading the code probably doesn't. It only changes your experience not everyone else's.

If it's the same codebase, just write that function :)

eitland · on Aug 16, 2020

Write the function, comment it!

Many of these are obvious to many people here, but some aren't.

Even I can admit that some of the stuff in this thread is not obvious at all.

random314 · on Aug 16, 2020

Why does it feel ridiculous?

kentonv · on Aug 16, 2020

Because I've already rewritten more of the standard library than is healthy.

I mean, it's clearly the right thing to do here but I can predict the conversation that will inevitably result... "You wrote your own tolower() function? Why?" "The standard one is horribly broken." "How could a function that lower-cases a letter be broken??? Jesus Kenton your NIH syndrome is out of control." "Sigh..."

(Slightly more seriously, any particular time I need to lower-case something, it takes 10 seconds to write out the code, but would take 10 minutes to find a good place to define a reusable function and exactly what its API should be, and so it never seems worth the effort in the moment. Just like how most messy code comes to be.)

random314 · on Aug 16, 2020

This conversation can be simply be avoided by copy pasting your original hacker news comment into the library function header.

I have noticed some coworkers have their ego gratified by being right while everyone else is wrong. Instead of simply explaining what they are doing when they are doing it, they will do something that looks wrong in a very noticeable way and wait for the backlash. The backlash gives them an opportunity to show everyone else how they were right while everyone else was wrong and also an opportunity to play victim. However, in SW development - it is not just the technical details - your behavior also matters in a big way.

In this particular case, the correct approach is to create your own library function with appropriate comments. This is why the concept of a library function was invented. It is its entire raison-d'etre. However, you are doing everything but that. Including providing justifications in hacker news comments instead of your source code.

Now inevitably, someone will change your inline code to use to_lower. This will give you an opportunity to scream bloody murder, show how other engineers don't really understand technical details, correct them and also play victim. Create a library utility with comments and link it in - End of story.

saagarjha · on Aug 17, 2020

I’m reminded of garbage collector blog posts where they do something stupid (“disable it”, “allocate a ballast”) and then get to spend a couple pages explaining why it worked for them.

kentonv · on Aug 16, 2020

Speaking of people wanting to gratify their ego by being right: Everyone on this thread trying to lecture me on software engineering? ¯\_(ツ)_/¯

random314 · on Aug 17, 2020

At least, you get to play victim :)

obmelvin · on Aug 17, 2020

Looks like you're only 10 times away from using this to having spent your 10 minutes ;)

But seriously, I'm not here to lecture you, but personally I'd appreciate having a teammate educate me on the undesired behavior and have a nice function I could use to ensure my own code doesn't break user input

nitrogen · on Aug 16, 2020

Most codebases I've worked with have a StringUtils.java, or .kt, or a str.c or utils.c. Maybe just start one. Interestingly I haven't needed it as much in Ruby.

But I too feel the cognitive (and social!) burden of introducing a new function. It's not just "where do I put this", but "how do I convince the team I know what I'm doing since 15 years of experience clearly isn't enough and developers (mostly rightly) ignore positional authority and seniority".

Izkata · on Aug 16, 2020

  #include <kentonv.h>

kentonv · on Aug 16, 2020

It's... It's called KJ...

https://github.com/capnproto/capnproto/tree/master/c++/src/k...

lathiat · on Aug 17, 2020

This project has the greatest sales pitch I've ever seen: https://github.com/capnproto/capnproto/blob/master/README.md -- combined with the project name it's just perfect.

cesarb · on Aug 16, 2020

Java has two variants of toLowerCase(): one which uses the default/current locale (almost never what you want), and one which receives an explicit locale (Locale.ROOT is almost always the one you want). At work, we use the "forbidden APIs" checker (https://github.com/policeman-tools/forbidden-apis) to fail the CI if the variant which uses the default locale is ever used; if you really want to use a locale-dependent toLowerCase(), you have to explicitly call Locale.getDefault() and use it as the locale.

Is there something similar for C and C++? It could help in your case, by making your well-meaning colleagues aware of the issue.

vesinisa · on Aug 16, 2020

> Locale.ROOT is almost always the one you want

At least Android developers are advised to use Locale.US: https://developer.android.com/reference/java/util/Locale

> The default locale is not appropriate for machine-readable output. The best choice there is usually Locale.US – this locale is guaranteed to be available on all devices, and the fact that it has no surprising special cases and is frequently used

It would be indeed interesting to see in which features these two locales actually differ.

rvnx · on Aug 16, 2020

Yes, tolower_l(string, locale)

smnrchrds · on Aug 16, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

On many documents, including Turkish passport and identity card and many (all?) other passports, names are written in all caps. Maybe toLower() is not that useful, but toUpper() is crucial in any application where you are dealing with real person names.

phonebanshee · on Aug 16, 2020

toUpper is definitely language-dependent. For example, in Irish there are initial letters that are written as lower-case even in all caps. Wikipedia's example is amusing, since it's a photo of a government passport office sign - the all-caps version of Oifig na bPasanna is OIFIG NA bPASANNA (photo https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/AL..., article https://en.wikipedia.org/wiki/Irish_orthography#Capitalisati...). It would look utterly bizarre to write OIFIG NA BPASANNA. And this isn't at all an unusual construction in Irish, it happens in personal names all the time.

Plus, there's the issue of diacritical marks. Irish keeps long marks over capitals, but French drops accents. Do you plan to do é => É (required for Irish - POBLACHT NA hÉIREANN is the all-caps version of Poblacht na hÉireann [the Republic of Ireland]) or é => E (common practice for French)? You have to get it right, and you have to know the language to do that. (Poblacht na hÉireann also illustrates the fact that initial caps is also a language-dependent idea; you absolutely can't write Poblacht Na Héireann - that makes my eyes burn just looking at it.)

(And before you say, well, Irish isn't a language spoken by very many people, remember that it's an official language of the European Union. If you're writing software to be used by EU agencies, you're going to have to care.)

smnrchrds · on Aug 16, 2020

> French drops accents

Official position of both Académie française and Office québécois de la langue française is that accents must be preserved in capital letters. However, it is common in France to drop them, while they are almost always preserved in Québéc. I have heard that the reason is that European French keyboard layout makes it difficult to type accented capital letters, unlike Québécois French layout which makes writing them easy. But I am not sure if this is the cause rather than the effect of the practice.

forty · on Aug 16, 2020

I confirm that in French capital letters should have accents.

I have an anecdote on this: on birth certificates, family names are written in capital letters. It turns out my partner name ends with a É which was written as E in her birth certificate. She never noticed (it had never prevented her to get national ID with her name properly accented) until we had our first kid which has both our names, and they refused to have the name accented until we had my partner's birth certificate updated (which as you can imagine is quite an adventure, since you need to dig ancient family birth certificates to prove it was originally written with an accent...).

ccccc0 · on Aug 16, 2020

French here, I recall my primary school textbook where they said something along the lines of "sometimes accents are dropped, that's sort of fine as long as it doesn't change the meaning". They gave the example of a fictitious newspaper whose headline was "UN POLICIER TUE": depending on the accent (tué/tue) it means either "a policeman kills" or "a policeman killed".

dongvsascript · on Aug 17, 2020

american here who's lived in france and still use an azerty keyboard because it lets me type in both languages. how do you get a capital A with an accent on an azerty keyboard?

nargek · on Aug 17, 2020

Easily ? You don't, with the most common azerty keyboard you have to use : Ctrl+Alt+7 Shift+A. That's why there is a new standard for azerty keyboard the "NF Z 71-300" that is better with accent and stuff like æ,œ,Æ,Œ,«» etc.

dongvsascript · on Aug 17, 2020

Damn it. That's why I couldn't figure it out for years. You can't.. Is the new keyboard standard being used anywhere? Like if I walk into a common office in france and sit down at a laptop - is it likely to be using the new keyboard layout?

What's weird is I sometimes, even 10 years ago, would get an email from people in France, and it would have an accented A. Like, how did they do that..

nargek · on Aug 17, 2020

i know that LDLC is selling one of these [1] ... and that's it, i don't even think it's coming to laptop anytime soon.

[1] https://www.ldlc.com/fiche/PB00279741.html

smnrchrds · on Aug 17, 2020

Not the answer to your question, but this is why I think Quebec uses accented capitals more than France. In Canadian French layout, there is a key for à. Simply using Shift+à gives you À.

dongvsascript · on Aug 17, 2020

classic canada, fixing the keyboard. now all that's left is above 69.

korean is worse. they base off of 10k, not 1000. so a million is hundred ten_thousand. bagman. but as bad as that is, it's no 97 amirite.

phonebanshee · on Aug 16, 2020

Interesting that I was wrong - another data point in the "it's more complicated than you think it is" column. I always thought you were supposed to drop them (because I was explicitly told so by a French engineer I worked with in the 90s, talking about one particular poster, and many years later still assume that one hallway conversation was enough to make that THE OFFICIAL RULE without bothering to actually check...)

kergonath · on Aug 16, 2020

It’s been a pet peeve of mine for quite a while, and an urban legend for quite a lot more. It was tolerated when all we had was typewriters (and even then you were supposed to add them, but it was cumbersome).

hocuspocus · on Aug 16, 2020

Wrongly dropping accents on uppercase letters predates computer keyboards; the French azerty layout puts accented letters on the first level of number row:

http://j.poitou.free.fr/pro/img/tkn/tw-image.jpg

This idiocy carried over. The recent layout update makes dead accents more accessible:

https://norme-azerty.fr/

But I haven't seen much adoption yet.

zorked · on Aug 16, 2020

That button to the right of the P that contains four different forms of dashes is... interesting.

Even more if you consider that the minus sign there is not the character - that is used by every programming language.

masklinn · on Aug 16, 2020

> That button to the right of the P that contains four different forms of dashes is... interesting.

And there are two more on the 8 key.

I like the mac international keyboard layout, but it still only provides for 4 of those: the non-breaking hyphen and the "proper" minus sign are lacking.

I like that the "new azerty" provides for pretty much every diacritic, even those which are not in use in french.

kergonath · on Aug 16, 2020

I blame the horrendous default settings of MS Word and Outlook for this, and the maddeningly convoluted way to enter accented caps on Windows. There is no context in which it is correct to omit accents in French, caps or otherwise.

rurban · on Aug 17, 2020

Nope, none of them are really useful. The only useful folding function is casefold(str, locale). (if your str type doesn't know it's locale).

toLower and toUpper should only be used for presentation, but all case-insensitive operations need to be done with casefold.

crazygringo · on Aug 16, 2020

Of course! There are tons of cases where you need to store in "sentence case" (first word and proper nouns and acryonyms capitalized, nothing else) so you can convert to title case or all-caps as needed for display purposes. Templates are full of this kind of stuff.

There are similarly tons of cases where you reduce everything to lowercase without accents for searching and indexing purposes. Depending on your setup, your database might handle that for you, but there are edge cases where you need to do it at the application level.

Long story short, every string has a locale, and you should never change the case of something without specifying its locale. Either be explicit that it's American English or ASCII or Latin1 or whatever... or that it's something else. Never leave someone reading the code guessing.

asveikau · on Aug 16, 2020

> you can convert to title case or ... for display purposes.

I am skeptical if someone thinks they need to do this and how they will get it done.

Eg. Looping through and capitalizing the first gylph after breaking whitespace regardless of locale is not the way to go, but I guarantee you a nontrivial amount of people reading this would write exactly that if asked to solve the high level problem.

I find it annoying when software or even in some cases human typists try to enforce English language title case. Some other languages have different rules for titles and capitalization and seeing the English rules enforced out of context can be jarring.

crazygringo · on Aug 16, 2020

I find it amusing you are skeptical... why so distrustful? But believe me, it's quite necessary.

I use the citation manager Zotero a lot. It's necessary to store all the titles of journal articles and books in sentence case (e.g. "Issues regarding the economy of France") because some publications require citations to use sentence style (remains unchanged) while others require title style ("Issues Regarding the Economy of France").

And obviously the solution cannot be naive, but is language-dependent, so that in English words like "the" or "in" don't get capitalized. And as the rules for titles are obviously language-dependent, so it goes without saying that the algorithm would have to be localized.

(Note that while it's relatively trivial to convert from sentence case to title case in English, it's impossible to automate in the opposite direction, because you never know if a capitalized term in the title is a proper noun or not.)

a1369209993 · on Aug 17, 2020

> (Note that while it's relatively trivial to convert from sentence case to title case in English[...])

Strictly speaking you can't do that either:

  "Latine et videtur".totitle = "Latine et Videtur"
  "I et some food".totitle = "I Et Some Food"

I suspect that fails less often though.

dragonwriter · on Aug 17, 2020

> I am skeptical if someone thinks they need to do this and how they will get it done.

It's not an uncommon requirement, though probably not often in a locale sensitive way, so you can often get away with just doing the right thing for one locale.

To do it generally, you probably need to research appropriate handling per locale (likely, what your client wants done in the particular locale, since I'm not sure there is usually just one way; I know there are multiple variations in en-US), and then have a master function that takes the text and target locale and applies the correct locale-specific title casing rules.

layer8 · on Aug 16, 2020

> I am skeptical if someone thinks they need to do this and how they will get it done

I most often use title case mappings in the context of replacements of names into diagnostic messages. I.e. you have n types of objects and m messages like “${foo} not found” or “More than one ${foo} required”, and perform title case mapping on ${foo} depending on whether it is at the start of the message (sentence) or not.

Groxx · on Aug 16, 2020

While I agree it's almost always a bad idea: effectively every design team I've encountered has requested stuff like this.

So yes, it's extremely common. It's done on tons of websites, for tons of mail addresses (ever receive an all-uppercase address on a delivery? same issue.), and tons and tons and tons of emails and legal documents (woe to those with last names like McCormick).

happytoexplain · on Aug 16, 2020

We frequently use localized upper/lower casing at my workplace, as we do not store such stylization in user facing copy. Most copy is written and translated in sentence case or title case (because both are much harder to achieve programmatically), and then our designers have the option of using that casing as-is, or using all-upper or all-lower.

mrighele · on Aug 16, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

If you are collecting data which include people's names or addresses you probably want localization to be applied correctly so that you can compare data coming from different sources and possibly with different cases. Having your name spelled differently in different documents can cause a non trivial amount of problems with an overzealous bureaucracy.

bjoli · on Aug 17, 2020

I was onced refused entry into a country for 6 hours because of different spellings of my last name. The (apparently quite amateur) travel agency had sent my last name written using OE instead of Ö, whereas all the documents relating to my identity use Ö (or if it was the other way around).

dusted · on Aug 17, 2020

We thought on this specifically when naming our son, one with no special characters, a pure clean ASCII string like mom used to cook them.

dataflow · on Aug 16, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

How do you lowercase without localization? Remember all text isn't English. Unless you're actually asking if anyone has ever had a use case for lower-casing non-English text?

lmm · on Aug 17, 2020

I think the real question is: is there any use case for toLower() where you want the system default locale to be applied? If you want to lower-case text for "system" purposes then you need to keep track of the locale associated with that text (which won't generally be the locale of the system the program is running on); the only case where you want to use the system default locale is where you're interacting with the (human) user of the system, but it's hard to imagine a use case where you'd want to use toLower() for displaying text to the user.

phonebanshee · on Aug 17, 2020

You vastly underestimate the desire of people for things to look 'nice' where 'nice' is defined by exactly what they mean when they say it. If you want a 'nice' display of data that's input by users, you're often going to butcher it by doing things like converting everything to lower-case, then maybe upper-casing the first letter. Because 'nice.'

It's horrifically wrong, of course, but no one is going to think you're a reasonable person for insisting that correct wins over nice.

Imagine a real-world scenario where you're displaying a list of names of users, where the users got to type in their own names. You can either use what users typed in, or you can do something like process it so it's in the American idea of initial caps. You can't possibly do localization, since it's a list of names of people in the US, so it's melting pot of names from all over the world, and you never asked for user input of what you'd use to localize it anyway [and no, you absolutely can't figure that out from just the name itself]. You can't use what users typed in, because the design team thinks that looks like a horrible mess (and it is; users are laughably bad at data entry). So the design team wins, and you butcher everything by pretending that toLower() plus toUpper() for the first character of every word is a sensible thing to do. (And yes, that's a painful real-world example of software I've shipped and that was used by millions of people.)

dataflow · on Aug 17, 2020

> it's hard to imagine a use case where you'd want to use toLower() for displaying text to the user.

Maybe for automatically changing the case of user input (auto-correcting capitalization, etc.).

reaperducer · on Aug 16, 2020

Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

Yes. In a system I'm about done with, there is a sortable chart of dates and times. In some languages day and month names are capitalized, and in some they are not.

bjourne · on Aug 16, 2020

How does that work? toUpper() can't possibly know that the string is a day or month name.

a-nikolaev · on Aug 16, 2020

I think, this is why you need explicitly ASCII and explicitly Unicode lower-/upper-/capitalized-transformations. So you don't assume these things to work automagically. Some times you need one type, the other times you need the other type.

__s · on Aug 16, 2020

I recently ordered a Pixel, on the mail slip they had converted my name to uppercase, last name read "DUBé"

Also got my address screwed up on account of living at a half address.. 1/2 some street #42

ramshorns · on Aug 16, 2020

A char in C++ is one byte, right? Is it even possible for this "fixed" code to call ctype::tolower() on something like a UTF-8 or UTF-16 code point?

kentonv · on Aug 16, 2020

Correct, it won't even work as intended with modern Unicode locales.

ramshorns · on Aug 16, 2020

So maybe if the code is broken anyway for non-ASCII characters, it's fine to use tolower, since somewhere else in the code it ensures that c is a byte.

kentonv · on Aug 16, 2020

The code is not broken for non-ASCII characters. UTF-8 works just fine with 8-bit chars, and the code I wrote correctly lower-cases ASCII letters even when UTF-8 is present (it just won't touch the UTF-8 chars, which is fine in this use case).

It's only tolower() and toupper() specifically that are broken because they expect to be able to do their job on a single byte, which is no longer possible with UTF-8.

Meanwhile, using tolower() to lower-case an HTTP header name won't give you the correct results if the locale is set to Turkish with the ISO 8859-9 character set, which is 8-bit, and where tolower('I') will produce the byte 0xFD which is 'ı' in this character set.

ramshorns · on Aug 16, 2020

I see, thanks for the explanation.

felixarba · on Aug 16, 2020

I have a morse code app which consistently crashed when certain users would try to translate letter "i", and it took me a long time to figure out that only the turkish users would complain about it, and when one of them sent me a screenshot I only noticed a "wrongly" rendered capital letter i (I used toUpper) and after digging around a bunch, I learned about this while turkish letter i.

ezoe · on Aug 17, 2020

A small portion of people whose native writing system are based from Latin alphabet believes that case conversion is an essential must have feature and having that feature in locale library help the localization.

But if you consider the other writing systems, having case conversion feature in locale library actually harm the localization effort. It's not easy to make it no-op. The implementation of locale library are generally poor quality because the implementer has no idea how other languages works.

Another example is singular/plural support. It just burden the localization effort because for languages which no such concept, the localization work must ensure that presence of such library doesn't harm their language.

Some people is under the delusion that locale library must have more features to support his native languages not so important traits. While what really necessary is just forget about supporting minor language traits that are not universal among the languages.

The text should be considered binary blob and most program should pass it left to right without modification.

karmakaze · on Aug 16, 2020

And a note that it assumes ASCII. On an EBCDIC system, the 'A'-'Z' test will translate other characters besides letters.

grumple · on Aug 16, 2020

Now I’m wondering about what happens when we change email addresses to lowercase...

https://en.m.wikipedia.org/wiki/Email_address#Internationali...

sedatk · on Aug 16, 2020

You shouldn't. Email addresses are case-sensitive.

bschwindHN · on Aug 17, 2020

Ahhh, mastahpiece

https://doc.rust-lang.org/std/string/struct.String.html#meth...

vasama · on Aug 16, 2020

This is why I have a set of functions like AsciiToLower(char* string, size_t size). They only touch characters in the ASCII space at <0x80. Even went and implemented them with SSE for x86.

tyingq · on Aug 16, 2020

Airlines might be a good example. The back end system doesn't grok lowercase characters at all, so you need to transform data to uppercase A-Z, 0-9 and a few punctuation marks.

miahi · on Aug 16, 2020

But they do have the most extensive transliteration rules library to match everything to that limited character set (ICAO Doc 9303[1]) that is used by many systems outside the aviation world.

[1] https://www.icao.int/publications/pages/publication.aspx?doc...

nurettin · on Aug 16, 2020

You need localization if you do any kind of multilingual text processing. Not sure how it could escape a thinking person's imagination.

aflag · on Aug 16, 2020

File names, URLs and email address support utf-8 characters and you may want to lower case them in many situations. If the user is trying to search for a string, they probably want case insensitivity. I don't think it's that rare/weird for people to want localisation to apply when calling toLower.

olliej · on Aug 17, 2020

Yes, semi regularly -- lowercasing of text for user interfaces is frequently required. Similarly case insensitive comparisons.

Human text is much much more complex than any computing protocol you're ever going to engage with.

The question is "which one should be default", and that's a more complicated question.

dragonwriter · on Aug 17, 2020

> Has anyone here ever had a use case for toLower() where they actually wanted localization to apply?

Well, it's always been exclusively in American English, but I've certainly used it in cases where I was doing text transforms for display, so, yeah, though it's not the most common case.

fovc · on Aug 16, 2020

What about sorting users by name?

phonebanshee · on Aug 16, 2020

That's completely language+locale dependent. For example, here'an alphabetical list of Irish surnames - https://www.duchas.ie/en/nom?txt=M. You'll notice that sort order ignores an initial O or Mac (or Ni or Bean, etc).

golergka · on Aug 17, 2020

Honestly, strings that are intended for human and for computer consumption should just be two different basic types without any implicit conversion between them.

jmiller099 · on Aug 16, 2020

i like c |= 0x20; :)

neeeeees · on Aug 17, 2020

I may be missing something - why is tolower(c) incorrect here?

kentonv · on Aug 17, 2020

Because if `c` is the letter 'I', and the current locale happens to be set to Turkish, then `tolower(c)` will return 'ı', not 'i'. If you are trying to lower-case an HTTP header name for the purpose of case-insensitive comparison, this is definitely not what you wanted. (And similar problems exist with several other locales; it's not just Turkish.)

neeeeees · on Aug 18, 2020

Ah I see, thanks for explaining