A ‘String’ is a sequence of characters.
Considering you’re most likely a programmer, there is a high probability that you already know this definition. And I think it is fair to assume that you have already used String
in at least one programming language.
Despite that, I should remind you one more time — String
in the programming world has nothing to do with the String Theory Sheldon Cooper popularizes more than any other physicist.
Introduction
In this article, I’m going to introduce you to some very important concepts and terms related to String
which you might need to use in your day-to-day programming.
Also, familiarity with these terms might help you in your job interview one day, who knows?
Though some of the terms have a complex meaning from a linguistic aspect, I’ll try to avoid difficult words and explain it from the view that programmers need to understand.
Another thing that needs to be mentioned, is that I’m not going to show any code examples or how you can do string operations in your favorite languages.
I’ll just talk about the terms and concepts and your job is to find out how to use them when you need them.
Character Encoding
The primary component of a String
is a Character
, but you probably already know that. But do you know that characters are just code representations?
I mean, the computer doesn’t know characters, letters, etc., nor does it store them as such in memory. The computer stores them as numbers (as binary, to be precise).
There are some conventions that define which number will represent which character. This character set maps some numbers or code points to textual characters.
Then, at the time of visual representation, computers read them from memory and represent them as characters based on that mapping. This mapping and character representation is called character encoding.
This article (if clicking the link doesn’t work try this link: http://kunststube.net/encoding/)does a great job explaining everything about it. So, I suggest you read that to gain a better understanding of character encoding.
Most novice programmers (especially those who started their programming with C-like languages) assume that all of the characters in the world are single byte and obviously ASCII characters.
I don’t blame them, because I was one of them too. But now I know that it’s not true and there is a vast world of Unicode characters out there and as a programmer you need to deal with them frequently.
So, the first thing you need to remember is:
All the characters in the world aren’t created equal and they are not always single-byte characters.
Most Unicode characters can be stored as a 16-bit or 2-byte data type.
But still, there are some that can’t. There are more than 136,000 code points defined in Unicode, where two bytes can store as many as 65,536 characters.
So, multi-byte data types are needed to store the remainder. You should keep in mind that:
- Some programming languages (C, C++, etc.) provide single-byte data type (
char
) specifically for ASCII characters and multi-byte data type (wchar_t
) for Unicode. - Some programming languages (Java, etc.) only have multi-byte data types for character types.
- The days of one byte == one character is long gone, and you need to keep that in mind.
Let’s give you another bit of information while we are at it.
Character encoding can be fixed length or variable length.
You might’ve heard of encoding schemes like UTF-8, UTF-16, UTF-32, etc. There are various other encodings though, but let’s just get a brief understanding of these three Unicode character encodings.
- UTF-16 represents most of the commonly used Unicode code points in a single 16-bit character type. It uses two 16-bit characters to represent the remainder. That means UTF-16 is a variable-length encoding using a minimum of 16-bits (two bytes) and a maximum of 32-bits (four bytes).
- UTF-8 uses one to four 8-bit characters to encode all Unicode code points. It’s also another variable-length encoding which represents the ASCII characters using a single byte. (That’s handy because it means ASCII text is also valid in UTF-8) and the rest of the code points by two, three, or four bytes as needed.
- UTF-32 uses four bytes for all characters. It is a fixed-length encoding.
Your knowledge about character encoding might come in handy when you work with internationalization and localization. Also, you can guess from this single discussion thread that this knowledge is very important in other aspects, such as how much space and memory your text will consume, based on its encoding.
All I’m saying is, just be careful about character encoding when you are dealing with String
or Text
.
String Immutability
Most of the languages provide String
as a basic data type. The thing you need to know about the language you are working with is whether the String
data type of your language is Mutable
or Immutable
.
For instance, Java and Python String
types are Immutable
.
If you try to treat an Immutable
String
as Mutable
it will throw an error and you might wonder what’s wrong with your string.
You might be curious to know why strings are Immutable
in some languages.
Let’s explore it a little.
You see, in languages like C, they store strings as an array of characters and act as general-purpose arrays. So, as you can easily mutate an array, you can mutate the String
too.
On the other hand, languages like Java and Python treat String
as Object
and they set aside a special area of memory called the string constant pool.
When the compiler sees a String
literal, it looks for the String
in the pool. If a match is found, the reference to the new literal is directed to the existing String
and no new String
object is created.
The existing String
simply has one more reference.
The point of making String objects immutable:
In the String
constant pool, a String
object is likely to have one or many references. If several references point to the same string without even knowing it, it would be bad if one of the references modified that String
value.
That’s why String
objects are immutable.
So, be sure about the immutability of strings in your favorite languages before you work with them.
String Terms
It’s time to learn some terms.
Substring
The substring of a string S, is a string that occurs in S.
Let’s say it more clearly.
Any sequential part of a string is its substring. So, ‘is nice’ is a substring of the string ‘This is nice’. But ‘nice is’ is not, because ‘nice is’ can’t be found as it is in the string ‘This is nice’. You must find the substring as it is in the String.
The list of all substrings of the string “apple” would be:
“apple”, “appl”, “pple”, “app”, “ppl”, “ple”, “ap”, “pp”, “pl”, “le”, “a”, “p”, “l”, “e”, “”.
Prefix
A prefix of a string S is a substring of S that occurs at the beginning of S.
This might be clearer with an example.
The list of all prefixes of the string “apple” would be:
“apple”, “appl”, “app”, “ap”, “a”
So, to be a prefix, a substring needs to start from the beginning of that string.
A proper prefix of a string is not equal to the string itself. So, “apple” won’t be a proper prefix of “apple”.
Suffix
A suffix of a string S is a substring of S that occurs at the end of S.
The list of all suffixes of the string “apple” would be:
“apple”, “pple”, “ple”, “le”, “e”
So, to be a suffix, a substring needs to end at the end of that string.
A proper suffix of a string is not equal to the string itself. So, “apple” won’t be a proper suffix of “apple”.
Subsequence
The formal definition of a subsequence is:
“A subsequence is a sequence that can be derived from another sequence by deleting some or no elements without changing the order of the remaining elements.” — Wikipedia
Let me explain it in simple words.
A string is a sequence of characters, right? So, if we pick some characters, maintaining the same sequence that appeared in the original String
, then the resulting String
would be a subsequence of the original string.
Let’s say we have a string “Hello”. Then the “eo” is a subsequence of that string.
But “ol” won’t be a subsequence because it didn’t appear in the same sequence in “Hello”.
The list of all subsequences for the word “apple” would be:
“e, l, le, p, pe, pl, ple, p, pe, pl, ple, pp, ppe, ppl, pple, a, ae, al, ale, ap, ape, apl, aple, ap, ape, apl, aple, app, appe, appl, apple”.
Did you get the difference between substring and subsequence?
Common Operations
Let’s learn some common operations of Strings
.
Concatenation
This means concatenating two or more strings together.
Let’s say there are two strings: ‘Hello’ and ‘World’. The concatenation result of these two strings results in another string ‘Hello World’.
Capitalization/Case folding
This means converting all the characters of a string to the same case, either lower case or upper case.
Capitalization has various use cases.
One use case: if you need to make a comparison between two Strings
in a case-insensitive manner, you always need to normalize both texts to either lower case or upper case.
Well, to be honest, it won’t always be that simple. You’d face some weird cases if your strings contain non-Latin characters.
Let’s say your string is resumé. To convert strings to upper case, you would generally convert the [a-z] character set to the [A-Z] character set.
So, what would be the upper case of é? In that case, you need to make other considerations like Unicode normalization and such. Learn more about it in this article.
Join
Though it seems from the name that ‘join’ is similar to ‘concatenation’, however, there is a subtle but important difference between them.
Unlike concatenation, join concatenates the strings using a given ‘separator’ or ‘glue’ between each of the string elements.
So, if a list of strings given is like: [‘Hello’, ‘world’, ‘again’], along with a separator/glue ‘-’, then the resulting string will be: ‘Hello-world-again’.
In some cases, the separator can be an empty string (‘’) where it will behave exactly like concatenation.
Split/Tokenize
Splitting or tokenizing a string means to break down a single string to several parts, based on a delimiter or set of delimiters.
Let’s say you have a String
: ‘Why so serious?’
Now, if you tokenize this string based on whitespace, you’ll end up with a list of strings or tokens: [‘Why’, ‘so’, ‘serious?’].
Notice that the delimiter won’t be included in the resulting strings.
Also, in some cases, you might need to tokenize a string based on a pattern, instead of some specific delimiters. This can be done easily by regex.
Trim/Strip
Trimming or stripping a string generally means removing leading and trailing whitespaces from a string.
Whitespace, in this context, is all the whitespace characters (space, tab, no-break space, etc.) and all the line terminator characters (LF, CR, etc.).
So, a given string: ‘ this is simple ’ would be ‘this is simple’ after trimming/stripping.
Variants of trimming/stripping include:
- Trimming/stripping only leading (left trimming) or trailing (right trimming) whitespaces.
- Any sequence of whitespace characters within the string is replaced with a single space, which is also known as ‘space normalization’.
Lexicographic Ordering
Sometimes you need to define an order for a list of strings.
The most popular ordering is ‘lexicographic order’, which is also known as ‘alphabetic order’ or ‘dictionary order’.
You’ve seen how words are ordered in a dictionary? That’s exactly what lexicographic ordering is.
So, if you order the list of words [‘ant’, ‘Zebra’, ‘apple’, ‘applecart’] in lexicographical order, it would be:
Zebra
ant
apple
applecart
Note that, in lexicographical ordering, upper case letters come before lower case letters, that’s why ‘Zebra’ comes before ‘ant’.
Also, longer strings come after the same prefixed smaller strings, that’s why ‘apple’ precedes ‘applecart’.
String Searching/Comparison
You’d often do tasks directly or indirectly associated with some sort of string searching and/or comparison.
Let’s talk about some common types of string/pattern matching.
The most common of them will be to check an exact match between two strings. Which you would assume you do with the == operator.
But I want to give you a simple tip which might save you one day from a lot of trouble.
If you store two strings in two variables and try to check their equality with the == operator, you need to know whether your programming language is doing value match or just reference match.
If your programming language treats String
as Object
, then most likely when you try to check two String
Object
with the == operator, it would check equality of the reference of the two [Object](https://stackoverflow.com/questions/513832/how-do-i-compare-strings-in-java)
, instead of their values.
On the other hand, other programming languages might just check the value equality of these two string variables. So, keep that in mind.
You’ve seen autocomplete or autosuggestion features in Google Search or your phone’s contact search, right?
If you observe this closely, you would see that they are doing some form of prefix match.
Your typed string vs the list of strings they’ve already stored.
So, how would you do it by yourself if needed? If your list of strings is not large, you can take any approach to match with all of them. But it won’t be efficient for large lists.
The efficient approach would be to use a sophisticated data structure like trie. Detailed discussion is beyond the scope of this article, so you should dive into it on your own.
Similarly, if you need to do a suffix match, you can convert the problem to a prefix-match problem by storing the reverse of the strings, along with the actual strings, or you can use the help of the suffix tree.
Sometimes, you need to do pattern matching, for example to check if a string is a substring of a given string, or if a pattern can be found in the given string.
In those cases, you need to grab help from various string searching algorithms and choose the one that fulfills your need.
In some cases, when we are running a search, we want to find relevant results, not only for the exact expression we typed in the search bar but also for the other possible forms of the words we used.
For example, it’s very likely we will want to see results containing the form “ran”, and “running” if we have typed “run” in the search bar.
This can be achieved through two possible methods: stemming and lemmatization.
The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. Though this term is most popular in NLP, it is also used in full-text search mechanisms in a database system and text search engines like Elasticsearch.
The last thing I want to mention here is ‘phonetic string matching’. Phonetic algorithms index words based on their pronunciation.
So, if you want to match two words by their pronunciation, you’d need to use phonetic algorithms. The most common uses of phonetic algorithms are spell checking, searching, etc.
There are several phonetic algorithms available to do the job, such as Soundex, Metaphone, etc. But be aware that most of the phonetic algorithms were developed for the English language. They might not work for other languages as you’d expect.
Conclusion
My primary focus in this article was to familiarize you with the buzzwords of String
and text processing. I may have given you a lot of information without explaining in detail, but if you know the terms, you can explore them in detail yourself, if necessary.
So, that’s all for today. Thanks for reading!
Originally published at: medium.com/better-programming/what-every-pr..