The type of char + char

blog.knatten.org

57 points by alexchamberlain 6 years ago

_kst_ 6 years ago

I wouldn't describe this as "no-one knows" the type of char + char.

I know what the type of char + char is. I know that it's either int or unsigned int, depending on the ranges of values supported by types char and int. I know what it is for any given implementation. And I know that it's int, not unsigned int, for every implementation I've ever used or am likely to use.

Implementation-defined features are not some unsolvable mystery. They're just implementation-defined.

BearOso 6 years ago

And we can count 99.999% of used implementations on one hand. If you’re on a strange platform, there’s a reason for that, and chances are you’re uniquely aware of any differences or will be writing assembly.

flqn 6 years ago

This confirms one of the guidelines I've always been taught; char is not an arithmetic type, and never treat it as such. It represents ascii characters, and nothing else.

_kst_ 6 years ago
I disagree, mostly.
char is an arithmetic type, but it rarely makes sense to treat it as one, because its signedness is implementation-defined. If you want a very narrow integer type, both signed char and unsigned char are arithmetic types, and can reasonably be used that way. (Arrays of unsigned char are also used for raw memory.)
And you should understand how char, signed char, and unsigned char behave when you do use them as arithmetic types.
Promotion to int or unsigned int, depending on the range of the type, can be confusing. The same applies to all integer types with lower rank than int, including short, unsigned short, and intN_t and uintN_t for N==8 (and probably for N==16, and maybe for larger N).
Note also that this:
```
    char c = '0';
    ++c;
```
is guaranteed to set c to '1'. (This guarantee applies only to decimal digits, not to letters.)

kccqzy 6 years ago

The compiler doesn't agree with you.

    #include <type_traits>
    static_assert(std::is_arithmetic<char>::value, "char is arithmetic");

SamReidHughes 6 years ago

'0' through '9' are guaranteed to be contiguous, so doing arithmetic around that fact is legitimate. And char does not generally represent ASCII characters; it could be some other charset.
Piezoid 6 years ago

(u)int8_t have the same problems, including aliasing because they are just alias of (unsigned) char. Sometimes it's nice to have modular arithmetic mod 256, or compact memory layout for eg. count sketches.
- wahern 6 years ago
  
  If int8_t exists[1], then you know that char is 8 bits[2] and therefore know that char in char + char always promotes to int because int must have at least 16 value bits and a 16-bit int can represent any char value regardless of signedness.
  [1] int8_t is not required.
  [2] char is the fundamental unit of addressability. sizeof char always evaluates to 1, sizeof int8_t must be non-0, char must be at least 8 bits, and int8_t must be precisely 8 bits, therefore sizeof int8_t == sizeof char and CHAR_BIT == 8.
beached_whale 6 years ago

I would use this for anything < int/unsigned. a short * short can result in a signed integer overflow and that is UB.

joker3 6 years ago

Why do we need to be able to add two characters again?

XMPPwocky 6 years ago
```
  char toupper(char c) {
    if (c >= 'a' && c <= 'z') {
      return (c - 'a') + 'A';
    else { return c; }
  }
```
- thaumasiotes 6 years ago
  
  It's worth mentioning explicitly that while `c - 'a'` is the more obvious application of character addition in your example, `c >= 'a'` is another one that's even more common. Pretty much everyone immediately understands that we want to be able to sort characters.
  
  XMPPwocky 6 years ago
  
  Yeah- but the problem of poorly-defined result type isn't present for comparison operators, since a bool is a b... oh, hold on, C. Since an int is an int. Sigh.
  
  slavik81 6 years ago
  
  In C, 'a' is an int so most of those are not char/char operations.
  In C++, 'a' is a char and the comparison result is a bool, though it doesn't really make a difference in that function.
  
  thaumasiotes 6 years ago
  
  When you're sorting, you're generally comparing two variables to each other, as opposed to comparing one variable to a literal constant.
- moring 6 years ago
  
  Adding two characters isn't strictly needed for that -- you're relying on the assumption that (c - 'a') is of type character, but it's actually the offset between two characters. The rules for those two types would be:
  char + char = invalid
  char + offset = offset + char = char
  offset + offset = offset
  char - char = offset
  char - offset = char
  offset - char = invalid
  offset - offset = offset
  Given that, (c - 'a') + 'A' is perfectly valid without adding two characters.
  edit: formatting
- _kst_ 6 years ago
  
  That relationship is valid for ASCII, and for character sets derived from ASCII, but it's not guaranteed by the language. In particular, in EBCDIC the alphabet is non-contiguous.
  
  XMPPwocky 6 years ago
  
  Absolutely- this code only works with ASCII-ish charsets.
- kccqzy 6 years ago
  
  You showed an example of subtracting a character from another. The GP asked for an example of adding two characters.
  
  tlb 6 years ago
  
  (c - 'a') + 'A'
  contains both
  
  XMPPwocky 6 years ago
  
  I thought about saying
  (c + 'A') - 'a'
  to make this more clear, but I think that's actually UB with signed chars- e.g. for c='a', 'a'+'A' exceeds the range of a signed 8-bit value!
  Promotion should save us here, but that's a bit too yikes-y for my comfort.
  
  kccqzy 6 years ago
  
  It does not. Subtracting a char from a char involves usual arithmetic conversions as well and the result is typically an int. Next, you have addition between an int and a char.
- brudgers 6 years ago
  
  The Erlang is Perlilous:
  toUpper(char) -> $ + char. % It's "$ "
vardump 6 years ago

That's a good question, now that we have [u]int8_t, [u]int16_t, etc. for explicit bitness values. (Although both can have more than specified number of bits on some platforms.)
- ynik 6 years ago
  
  But `uint16_t + uint16_t` has exactly the same problem -- if it's a typedef for `unsigned short`, there will be promotion to either `int` or `unsigned int`.
  A multiplication `uint16_t * uint16_t` can still cause an overflow after promotion to signed int, which is undefined behavior! So "unsigned types wrap around" doesn't apply to `uintN_t`, because you can never know for sure whether those types are "smaller than int" and thus get promoted to signed types when you do any arithmetic.
  Of course, in practice this just means: every C and C++ program relies on tons of implementation-defined behavior. A `sizeof(int)` greater than 32-bits would break most code in existence (e.g. hash code computations using `uint32_t`).
  
  vardump 6 years ago
  
  > A multiplication `uint16_t * uint16_t` can still cause an overflow after promotion to signed int
  Last time similar thing bit me was when the platform had 16-bit int... so just adding two int16_t can very well cause int overflow.
  
  wahern 6 years ago
  
  > because you can never know for sure whether those types are "smaller than int" and thus get promoted to signed types when you do any arithmetic
  You can deduce the width (number of sign + value bits) of the standard integer types from their limits (e.g. INT_MAX, INT_MIN, etc). The problem has been that this is non-trivial if not impossible to do from the preprocessor. The next C standard will include width constants (e.g. INT_WIDTH) for the standard integer types.
  
  loeg 6 years ago
  
  In practice programs on these weird large-int machines would just use `-fwrapv` and move on.
- nwellnhof 6 years ago
  
  No, uint8_t is guaranteed to be exactly 8 bits wide (unlike uint_fast8_t or uint_least8_t).
qmmmur 6 years ago

String forming.
- vardump 6 years ago
  
  How would you form strings by adding 'char' type values together? This is not about concatenation operation, we're talking about C/C++. They do not have syntactic sugar concat for chars.
  (In C++, you of course have operator overloading, that's how std::string concat sugar works.)
  '1' + '1' == 'b'. Because 49 + 49 == 98. ASCII '1' == 49, and 'b' == 98.

_fbpt 6 years ago

I found some insightful comments below the post:

>I think that char – char should definitely be legal. The distance between characters is well defined. Same for char + numeric. Both logically makes sense. I think a good analogy might be floors in a building. Asking what’s the distance between the second and seventh floor makes sense, or what’s two floors above the 4th. But the question ‘what’s the 5th floor plus the 6th floor’ doesn’t make sense.

>Affine space describes these kind of relationships in mathematics. Eg position and disposition in n dimension, or count and offset in buffers, even timestamp and duration.

nayuki 6 years ago

I agree with the article. Here are discussions of related problems with C/C++ arithmetic promotions and overflow:

https://stackoverflow.com/questions/27001604/32-bit-unsigned...

https://stackoverflow.com/questions/39964651/is-masking-befo...

klyrs 6 years ago

I've been writing C for 25 years... and while I technically know "the answer," it's effectively a closed door in my mind because I don't always know where my code will end up.

A sadistic part of me would prefer if it was interpreted as a bitwise and... not because that's good or reasonable or smart... but to punish the behavior. But then that backfires when people use it for underhanded code.

kstenerud 6 years ago

Yes, yes. The spec is filled with anachronisms that are no longer pertinent in today's machines. char + char gets promoted to int every time in today's compilers. Try it out here: https://godbolt.org/z/V5HEvV

mv1 6 years ago

There is what the standard says, and there is what people actually do. If everyone promotes char to int in practice, then any machine where this doesn't happen is going to have a tough time running the bulk of code out there.
In a standards committee, the standard is the standard, in practice common practice is the standard.
loeg 6 years ago

50-50 anachronisms vs flexibility to allow C to run on novel machines that we don't currently envision. Sure, it would be nice for developers on today's machines to reduce it to the conventional subset.
- kstenerud 6 years ago
  
  If by novel you also imply compatible, then sure. The moment you create a machine that's incompatible with the conventions adopted by the most popular compilers and architectures, you break a ton of software built upon those conventions, and sink your hardware in the market because it's a portability nightmare.
  Specs don't matter beyond the conventions they inspire.

loeg 6 years ago

A machine where char is as large as int is unlikely in practice as it isn't very useful. C11 (at least) defines INT_MIN/MAX as covering at least the range of an int16_t type.

That said, the int promotion alone may be surprising / nonobvious to some people (it was to me, when I learned about it!).

fulafel 6 years ago

Also, if char is signed, char + char may be UB and with known overflowing values the compiler may deduce it's a can't-happen situation, generating code accordingly. Or when encountered at runtime, it may hose your program state arbitrarily, etc.

kccqzy 6 years ago

There are hardly any systems relevant today for which adding two char would result in an unsigned int. So basically just treat it as int and call it a day.

viraptor 6 years ago

Or if you're using one, you're likely very aware of that fact and don't suddenly discover it from an blog post.

quocble 6 years ago

Does it matter if it is signed or unsigned int or char? Bitwise it contains the same amount of information. That's the most important thing.

MaulingMonkey 6 years ago

decltype('a'+'a')

corysama 6 years ago

int
https://cppinsights.io/s/5dc91167

heyWowMyGuy 6 years ago

Gee, because string concatenation is crazy. Who would ever want to concatenate a string?

The data type of characters defaults to 8 bit octets, or bytes. You may have heard of them.

There's this thing called "unicode" which sounds exotic, but believe it or not, that's how emojis work, and I know you like emojis. Don't you?

When it comes to SQL databases, fixed length "char()" columns usually truncate overruns, and fill underruns with white space characters, unless alternative rules are specified. A "varchar()" field will gracefully tolerate concatenations up to it's limit, then truncate.

In C, we know that byte arrays often leave random data in uninitialized cells, so an underrun that isn't NULL terminated might acquire adjacent trash that had been previously allocated by something else, or in other cases display incidental thermal noise from the RAM chip.

Long story short, char + char has default type dispositions that are not mysterious.