Skip to content Skip to sidebar Skip to footer

How to Read Groups of Characters as a String in C

Introduction

Everybody knows how to escape specific characters in C# string. And then, why to bother about this?

This tip shows the quirks involved with escaping in C#:

- graphic symbol literal escaping: e.g. '\'', '\n', '\u20AC' (the Euro € currency sign), '\x9' (equivalent to \t))
- literal string escaping: eastward.g. "...\t...\u0040...\U000000041...\x9..."
- verbatim cord escaping: east.g. @"...""..."
- string.Format escaping: due east.g. "...{{...}}..."
- keyword escaping: e.g. @if (for if as identifier)
- identifier escaping: e.g. i\u0064 (for id)

Tabular array of Contents

  • Introduction
    • Tabular array of Contents
  • Escaping - what for?
  • Escaping in grapheme and cord literals
    • Summary
  • Escaping in verbatim strings
    • Summary
  • string.Format escaping
    • Summary
    • Bonus
  • Escaping identifiers
    • Summary
  • Escaping in Regular Expressions
    • Summary
    • Bonus
  • Links
  • History

Escaping - what for?

Again, everybody know this - or has at least a feeling for it. All the same, I'd like to just remind what escaping is good for.

Escaping gives an alternative meaning to the "normal" meaning. "Normal" is a matter of what is unremarkably used. At that place is no absolute reference for what is "normal", so, each escape mechanism defines what is "normal" and what is the escape for it.

E.thousand. a string literal is enclosed in double quotes "...". The meaning of the double quotes is to enclose the cord literal - this is the normal significant of double quotes for strings. If you lot want now to include a double quote in a string literal, you must tell that this double quote does non accept the normal meaning. E.yard. "..."..." would terminate the string at the second double quote, where as "...\"..." escapes the 2nd double quote from being interpreted as terminating the cord literal.

There are a multifariousness of established escaping mechanisms. The motivation for escaping vary equally well. Some motivation to employ escaping:

  • In string and character literals:
    • One must exist able to embed the terminators, like single or double quote.
    • One needs to enter special characters that have no character symbol associated, like a horizontal tabulator.
    • One needs to enter a graphic symbol that has no direct fundamental on the keyboard, similar the Yen currency symbol (¥).
    • etc.
  • In identifiers:
    • I needs to enter names with characters that have no equivalent key on the keyboard, similar the German umlaut Ä (Unicode 0x00C4).
    • One needs to generate C# lawmaking that may use identifiers that clash with the C# keywords, like yield.
    • etc.
  • In string formatting:
    • One must exist able to enter a literal { or } in a string.Format(...), similar in
      Panel.WriteLine("...{...", ...).
  • In Regular Expressions:
    • Ane must lucifer characters that had otherwise a control meaning, similar matching the character [, etc.
  • etc.

Then, permit'south start discussing the several various machineries to escape the normal behavior.

Escaping in grapheme and string literals

Permit's first look at the strings. A string is a sequence of characters. A graphic symbol is a type that holds an UTF-16[^] encoded value. A character therefore is a ii-byte value.

Eastward.g. the UTF-16 lawmaking decimal 64 (hexadecimal 40) is the @ character.

Annotation: There are a few "characters" which cannot directly exist encoded in these 2 bytes. These characters occupy 4 bytes, thus, a pair of UTF-16 values. These are chosen UTF-xvi: surrogate pair[^] (search for "surrogate pair").

So, the string is a sequence of two-byte characters.

E.thou. the string "abc" results in the executed plan in a sequence of the 3 UTF-16 values 0x0061, 0x0062, 0x0063. Or the Euro currency sign is the Unicode grapheme 0x20AC (€) and the Yen currency sign is the Unicode character 0x00A5 (¥).

How to write that in C#?

          char          euro =          '          \u20ac';          char          yen  =          '          \u00a5';

The detail \uxxxx denotes a UTF-16 lawmaking.

Equally an alternative, ane can write \u.... as \x, followed by one to four hex characters. The in a higher place example can besides be written as

          char          euro =          '          \x20ac';          char          yen  =          '          \xa5';

Note: the \x sequence tries to match as much as possible, i.due east. "\x68ello" results in "ڎllo" and not in "hullo" (the \x68e terminates after 3 characters since the post-obit graphic symbol is non a possible hex character. Every bit a outcome, \u... is safer than using \x... since the length in given in the first example, where in the 2nd case, the longest match is taken which may fool yous.

Notes:

  1. Please note that the upper case \Uxxxxxxxx item denotes a surrogate pair. Since a surrogate pair requires a pair of UTF-16 characters, it cannot exist stored in one C# grapheme.
  2. \u must be followed by exactly 4 hexadecimal characters
  3. \U must be followed past exactly eight hexadecimal characters
  4. \ten must be followed by one to four hexadecimal characters

Ah, yes, since it is common knowledge to everyone, I almost forgot to provide the short character escape notation of some often used special characters like \due north, etc.:

Short Notation UTF-xvi character Description
\' \u0027 allow to enter a ' in a character literal, e.g. '\''
\" \u0022 allow to enter a " in a string literal, east.yard. "this is the double quote (\") character"
\\ \u005c allow to enter a \ character in a character or string literal, due east.thou. '\\' or "this is the backslash (\\) graphic symbol"
\0 \u0000 allow to enter the grapheme with code 0
\a \u0007 alarm (usually the HW beep)
\b \u0008 back-space
\f \u000c form-feed (next page)
\n \u000a line-feed (side by side line)
\r \u000d carriage-render (motility to the offset of the line)
\t \u0009 (horizontal-) tab
\5 \u000b vertical-tab

Summary

  • characters are ii-byte UTF-16 codes
  • UTF-16 surrogate pairs are stored in a pair of C# characters
  • the escape character \ introduces escaping
  • what follows the \ character is
    • one of the short notations characters (\\, \", \', \a, ...)
    • a Unicode graphic symbol lawmaking (\u20a5, \u00a5, ...)
    • a surrogate pair (\Ud869ded6, ...) which can only be stored in a string but not in a single character.
    • a hex sequence of ane to iv hex characters (\xa5, ...)

Escaping in verbatim strings

What are verbatim strings? This is Syntactic Saccharide[^] to enter strings in C#.

E.g. storing a Windows file path like

          string          path =          "          C:\\Program Files\\Microsoft Visual Studio ten.0\\";

can exist considered as bad-mannered or ugly. A more user-friendly version is the verbatim string:

          cord          path =          @          "          C:\Program Files\Microsoft Visual Studio 10.0\";        

A verbatim string ( @"..." ) takes the content as-is without any interpretation of any character. Well almost; there is exactly one character that can be escaped: an embedded " must be escaped as "". E.g.

          string          xml =          @"          <?xml version="                    "          i.0"                    "          ?> <Data> ... <Data>";

Annotation: As mentioned above, the verbatim cord literal is a convenience way to enter a string literal in C#. The resulting memory prototype of the strings is the same. E.yard. these are all identical string contents:

          string          v1 =          "          a\r\nb";          string          v2 =          "          \u0061\u000d\u000a\u0062";          string          v3 =                      @"            a b";             Panel.WriteLine("          v1 = \"{0}\"\nv2 = \"{1}\"\nsame = {2}", v1, v2, v1 == v2);             Console.WriteLine("          v1 = \"{0}\"\nv3 = \"{1}\"\nsame = {2}", v1, v3, v1 == v3);

results in

v1 = "a b" v2 = "a b" same = True v1 = "a b" v3 = "a b" same = True

Summary

  • verbatim string literals and normal string literals are 2 ways to define string content
  • verbatim cord have all given characters equally-is, including new lines, etc.
  • the but escape sequence in a verbatim cord literal is "" to denote an embedded " grapheme

string.Format escaping

Format strings are interpreted during runtime (not during compile time) to supervene upon {...} by the respective arguments. E.one thousand.

Panel.WriteLine("          User =            {0}", Environs.UserName);

Only what if you want to have a { or } embedded in the format cord? Is it \{ or {{? Think of it!

Conspicuously the 2d. Why? Let's elaborate on that.

  1. The format cord is a cord like any other. Yous can enter information technology as
                  string.Format("              ...", a, b);              string.Format(@"              ...", a, b);              string.Format(s.GetSomeFormatString(), a, b);
  2. If C# would allow to enter \{ or \} information technology would exist stored in the string as { and } respectively.
  3. The string.Format part then reads this cord to decide if a formatting instruction is to be interpreted, east.thousand. {0}. Since the \{ resulted in { character in the string, the cord.Format function could not decide that this is to be taken as a format didactics or equally literal {.
  4. The culling was some other escaping. The established way is to double the character to escape.
  5. Then, string.Format treats {{ as literal {. Analogous }} for }.

Summary

  • string.Format(...) escaping is interpreted at runtime only, not at compile time
  • the 2 characters that have a special pregnant in string.Format(...) are { and }
  • the escaping for these two characters is the doubling of the specific character: {{ and }} (e.one thousand. Console.WriteLine("{{{0}}}", "abc"); results in console output {abc}

Bonus

The post-obit code scans the C# string format text and returns all argument ids:

          public          static          IEnumerable<int> GetIds(string          format) {          string          pattern =          @"          \{(\d+)[^\}]*\}";          var          ids = Regex.Matches(format, pattern, RegexOptions.Compiled)                    .Cast<Match>()                    .Select(thou=>int.Parse(grand.Groups[1].Value)); }          foreach          (int          northward          in          GetIds("          a {0} b {ane } c {{{0,x}}} d {{e}}")) Panel.WriteLine(n);

Passing "a {0} b {1 } c {{{0,ten}}} d {{east}}" results in

0 1 0

Escaping identifiers

Why would 1 escape identifiers? I judge, this is non really intended for daily use. It is probably but useful for automatically generated C# code. Nonetheless, there is 2 mechanisms to escape identifiers.

  • ascertain an identifier that would clash with keywords
  • define an identifier that contains characters which accept no equivalent on the keyboard

Option A: prefix an identifier by @, eastward.thousand.

          int          @yield =          x;

Choice B: utilise UTF-xvi escape sequences as described above in the string literals above, eastward.thou.

          int          \u0079ield =          10;

Notes:

  • A keyword must stay unescaped, i.e. if an identifier is written as @30 it is alwas an identifier (i.due east. never a keyword).
  • The aforementioned holds for identifiers that contain UTF-sixteen escape sequences
  • You can mix and match escaped identifiers, e.g. the following are identical:
                  while              (@a >              0) \u0061 = a -              1;              while              (a >              0) a = a -              ane;

Summary

  • identifier escaping is available in C#
  • identifiers tin can exist prefixed by @ to avoid keyword clashes
  • identifier characters tin can be encoded by using UTF-sixteen character escape sequences
  • the escaped identifiers must even so be from the legal graphic symbol sets - you cannot define an identifier containing a dot, etc.
  • numbers, operators, and punctuation cannot be escaped (eastward.g. i.0f, etc. cannot be escaped)
  • My opinion: escaping identifiers is not intended for daily use - due east.1000. don't ever attempt to prefix any identifiers past a @! This is meant for automatically generated code merely, i.e. no user should always see such an identifier...

Escaping in Regular Expressions

Regex pattern strings are also interpreted at runtime, like string.Format(...). The Regex syntax contains instructions that are introduced by \. E.g. \d stands for a single character from the set up 0...ix. I don't go into the Regex syntax in this tip, but rather how to conveniently put such a Regex blueprint into a C# string.

Since the Regex pattern virtually probable contains some \, it is more convenient to write Regex patterns equally verbatim string. The issue is that the \ does non need to be escaped in the blueprint. East.thousand. the following patterns are identical for the Regex blueprint \d+|\w+ (make up one's mind yourself which i is more convenient):

          var          match1 = Regex.Matches(input,          "          \\d+|\\w+");          var          match2 = Regex.Matches(input,          @"          \d+|\w+");

In that location is a gotcha: entering double quotes looks a bit odd in a verbatim cord. Finally it'due south your choice which way yous enter the blueprint, as normal string literal or as verbatim cord literal.

Summary

  • Regex patterns are conveniently entered as verbatim cord @"..."

Bonus

The post-obit code shows tokenizing C#. Try to understand the escaping Wink | ;-) :

          string          strlit  =          @"          "          "          (?:\\u[0-9a-fA-F]{iv}|\\U[0-9a-fA-F]{viii}|\\ten[0-9a-fA-F]{1,four}|\\.|[^"          "          ])*"          "          ";          string          verlit  =          @"          @""(?:"""          "          |[^"          "          ])*"          "          ";           string          charlit =          @"          '(?:\\u[0-9a-fA-F]{4}|\\10[0-9a-fA-F]{i,4}|\\.|[^'])'";          string          hexlit  =          @"          0[xX][0-9a-fA-F]+[ulUL]?";          string          number1 =          @"          (?:\d*\.\d+)(?:[eE][-+]?\d+)?[fdmFDM]?";          cord          number2 =          @"          \d+(?:[ulUL]?|(?:[eE][-+]?\d+)[fdmFDM]?|[fdmFDM])";          cord          ident   =          @"          @?(?:\\u[0-9a-fA-F]{iv}|\\U[0-9a-fA-F]{viii}|\w)+"; string[] op3   =          new          string[] {"          <<="}; string[] op2   =          new          string[] {"          !=","          %=","          &&","          &=","          *=","          ++","          +=","          --","          -=","          /=",          "          ::","          <<","          <=","          ==","          =>","          ??","          ^=","          |=","          ||"};          cord          residuum =          @"          \S";          cord          skip =          @"          (?:"+          string.Join("          |",          new          string[] {          @"          [#].*?\n",                                               @"          //.*?\n",                                                @"          /[*][\s\S]*?[*]/",                                       @"          \due south",                                            }) +          @"          )*";          string          pattern = skip +          "          ("          +          string.Join("          |",          new          string[] {     strlit,                                               verlit,                                               charlit,                                              hexlit,                                               number1,                                              number2,                                              ident,                                                     string.Join("          |",op3.Select(t=>Regex.Escape(t))),           string.Join("          |",op2.Select(t=>Regex.Escape(t))),      residue,                                             }) +          @"          )"          + skip;          string          f =          @"          ...";           string          input = File.ReadAllText(f);          var          matches = Regex.Matches(input, pattern, RegexOptions.Singleline|RegexOptions.Compiled).Cast<Match>();          foreach          (var          token          in          from          m          in          matches          select          grand.Groups[1].Value) {     Console.Write("                      {0}", token);          if          ("          {};".Contains(token)) Console.WriteLine(); }

Have fun!

Links

The following links may provide additional information:

  • C# Reference (MSDN)[^]
  • Standard ECMA-334: C# Language Specification[^]
  • UTF-16[^]
  • EBNF C# Syntax description[^]
  • What grapheme escape sequences are bachelor in C#? (MSDN)[^]
  • Cord.Format Method (MSDN)[^]
  • The Regular Expression Object Model (MSDN)[^]
  • Regular Expression Language - Quick Reference (MSDN)[^]

History

V1.0 2012-04-23
Initial version.
V1.1 2012-04-23
Fix broken formatting.
V1.2 2012-04-25
Fix typos, add more than links, set up HTML unicode literals in the text, update some summaries.
V1.iii 2012-08-21
Fix \10... clarification. Make some tables of class ArticleTable (looks a bit nicer)

masonextrave.blogspot.com

Source: https://www.codeproject.com/articles/371232/escaping-in-csharp-characters-strings-string-forma

Enregistrer un commentaire for "How to Read Groups of Characters as a String in C"