the biggest example of the "it's impossible to tell if a given computer problem is easy or hard" is that correctly solving "I want to store, show, and share some text" involves a lot of "Well, first of all, we have to talk about Unicode..."

I spend time in a c# advice chat, and let me tell you, people get really mad when they want to know "why can't I just convert a string to an array of bytes" and you start talking about encodings

it's a more general version of the problem that, by the time people show up in a chat like this, they're pretty much at the end of their rope and you're like the third person who's tried to say "your question doesn't make any sense, you have to know about X"


This is a common thing in computer town, but if it's your first time running up against "this thing is much more complicated than you think", hoo boy, you're in for a ride

(everything's more complicated than you think. computers are sand we tricked into thinking. it's a giant pillar of abstraction and nobody knows how the whole thing works.)

@BestGirlGrace one time when I was still an intern with SPAWAR, this whole project ground to a halt because the database connection kept crashing whenever the program went to store text from the tweets we wanted to save for later analysis and it took me two days to figure out that emojis and non-ascii chars were the problem (the SQL database’s implementation of UTF8 was non-standard) and then fix it.

@Sapphicgiraffic @BestGirlGrace there’s like at least two XKCD comics about this let me see if I can find ‘em...

@Sapphicgiraffic I was specifically thinking about the "picture of a bird" XKCD, but there's gotta be one about how emoji break things

@BestGirlGrace yeah that was one of them and the other was the “N competing standards becomes N+1 competing standards” wrt character encodings.

Now if u really wanna fuck some people’s minds, just wait for somebody to ask about video codecs...

@Sapphicgiraffic @BestGirlGrace ahahaha was this MySQL and the infamous utf8 aka utf8mb3 encoding? it took my last company months to migrate everything to utf8mb4, but we had to because our players were pissed about not being able to yell at each other in emojis and/or Chinese

@VyrCossont @BestGirlGrace that sounds familiar. It was definitely MySQL. I don’t remember what version I was on and I think I switched to something called “utf8-extended” which is a funny thing to call the full implementation but 🤷🏻‍♀️

Fortunately for me at the time it was just an early stage research project so I could dump it and start fresh instead of having to migrate everything.

@Sapphicgiraffic @BestGirlGrace look, this is MySQL, you used to have to tell it if you didn't want it to silently truncate overlength strings and allow division by zero

but on the other hand, its default mode now includes something called STRICT_TRANS_TABLES, so who can say if it's bad or not

@VyrCossont @Sapphicgiraffic This is when they hire me to stand there with a whip and keep things in shape.

@VyrCossont @Sapphicgiraffic At my first programming job, I was trying to convince the DBA to use the "Unicode text" columns for names instead of just messages in case anyone had an accent in their name, and she said (and I hope it was a joke) that "those people are all terrorists anyways"

@BestGirlGrace @Sapphicgiraffic the kind of military population that is responsible for ending the world in 90 minutes or less, yeah? 😖

@VyrCossont @Sapphicgiraffic I don't think they keep nukes at Ellsworth any more, but once upon a time, yeeep

@Sapphicgiraffic @BestGirlGrace I'm thinking of a post I saw a while back about how JS and python3 interpret "length of a string" differently... apparently JS is just (the length of the utf16 in bytes, divided by 2) which of course is not really intuitive for codepoints which take more than two bytes in utf16

@transbian_tronbreon @Sapphicgiraffic Yeah, the two reasonable meanings for "length of a string" are "how many bytes are in this for serialization reasons", in which case you really want to talk about encodings, or [long Unicode discussion because is a letter with a combining diacritic after two characters or one? what about those ZWJ sequences?]

@BestGirlGrace @Sapphicgiraffic I would just say there are 3 useful "length of string":
1) num bytes in the encoded version (C's strlen() or similar)
2) num unicode scalars (rust's .chars() or similar)
3) pixels wide when rendered (SDL's TTF_Size*() or similar)

@BestGirlGrace all true, but the other side of the problem is advisors who think the querant needs *expertise* in some domain before they can possibly be helped. For every "look, you need to know some basics about Unicode and encodings to approach this" there are a dozen "you need to shave the Unicode Yak before you are worthy"

Sign in to participate in the conversation
Princess Grace's Space Base Place

Don't let the name fool you. All the pornography here is legal, and much of it is hand-written. No fascists, no bigots.