Views: 16
Regular Expressions: Charsets
Searching for Specific Strings
- Use
grep 'string' <file>to search for an exact match. - To find patterns rather than exact strings, Regular Expressions (regex) are used.
Charsets in Regex
- Definition: Enclosed in
[ ], a charset matches any character(s) inside. - Basic Examples:
[abc]→ Matches any occurrence of ‘a’, ‘b’, or ‘c’.[abc]zz→ Matches ‘azz’, ‘bzz’, and ‘czz’.[a-c]zz→ Equivalent to[abc]zz.
Using Ranges
a-z→ Matches lowercase letters.A-Z→ Matches uppercase letters.0-9→ Matches any digit.[a-cx-z]zz→ Matches ‘azz’, ‘bzz’, ‘czz’, ‘xzz’, ‘yzz’, ‘zzz’.
Matching and Excluding Patterns
[a-zA-Z]→ Matches any single letter (lowercase or uppercase).file[1-3]→ Matches ‘file1’, ‘file2’, ‘file3’.[^k]ing→ Matches ‘ring’, ‘sing’, ‘$ing’, but NOT ‘king’.[^a-c]at→ Matches ‘fat’, ‘hat’, but NOT ‘bat’ or ‘cat’.
Important Notes
- Charset vs. String Matching:
[abc]matches any occurrence of ‘a’, ‘b’, or ‘c’ in a string, not necessarily “abc” in order. - Order Matters: When specifying charsets, match the given order in the question.
- Efficiency in Regex:
- Be specific when possible (e.g.,
[a-c]instead of[a-z]if only ‘a’ to ‘c’ is needed). - Avoid unnecessary complexity (e.g.,
[a-z]is preferable if many scattered characters are required).
Regular Expressions: Wildcards and Optional Characters
Wildcard Matching (. Dot)
.(dot) matches any single character (except line breaks).- Example:
a.cmatches: aac,abc,a0c,a!c, etc.
Optional Characters (? Question Mark)
?makes the preceding character optional.- Example:
abc?matches: ab(withoutc)abc(withc)
Matching a Literal Dot (\.)
.is a special character, so to match a literal dot (.), use\..- Example:
a.cmatchesabc,a@c,a#c, etc.a\.cmatches onlya.c.
Regular Expressions: Line Anchors and Grouping
Line Anchors
- ^ → Matches the start of a line.
- Example: ^abc matches lines starting with “abc”.
- $ → Matches the end of a line.
- Example: xyz$ matches lines ending with “xyz”.
Important Note:
- ^ has two meanings:
- Inside [] brackets: Excludes characters ([^abc] means “not a, b, or c”).
- Outside brackets: Specifies the start of a line.
Grouping and Either/Or (|)
- Grouping with (): Used to group patterns or repeat patterns.
- Either/Or (|): Works like an “OR” condition.
- Example: during the (day|night) matches:
- “during the day”
- “during the night”
Repeating Groups
- (pattern){n} repeats the pattern n times.
- Example: (no){5} matches “nonononono”.
Sometimes it’s very useful to specify that we want to search by a certain pattern in the beginning or the end of a line. We do that with these characters:^ – starts with$ – ends with
So for example, if you want to search for a line that starts with abc, you can use ^abc.
If you want to search for a line that ends with xyz, you can use xyz$.
Note: The ^ hat symbol is used to exclude a charset when enclosed in [square brackets], but when it is not, it is used to specify the beginning of a word.
You can also define groups by enclosing a pattern in (parentheses). This function can be used for many ways that are not in the scope of this tutorial. We will use it to define an either/ or pattern, and also to repeat patterns. To say “or” in Regex, we use the | pipe.
For an “either/or” pattern example, the pattern during the (day|night) will match both of these sentences: during the day and during the night.
For a repetition example, the pattern (no){5} will match the sentence nonononono.
Metacharacters
There are easier ways to match bigger charsets. For example, \d is used to match any single digit. Here’s a reference:\d matches a digit, like 9\D matches a non-digit, like A or @\w matches an alphanumeric character, like a or 3\W matches a non-alphanumeric character, like ! or #\s matches a whitespace character (spaces, tabs, and line breaks)\S matches everything else (alphanumeric characters and symbols)
Note: Underscores _ are included in the \w metacharacter and not in \W. That means that \w will match every single character in test_file.
Often we want a pattern that matches many characters of a single type in a row, and we can do that with repetitions. For example, {2} is used to match the preceding character (or metacharacter, or charset) two times in a row. That means that z{2} will match exactly zz.
Here’s a reference for each repetition along with how many times it matches the preceding pattern:
{12} – exactly 12 times.{1,5} – 1 to 5 times.{2,} – 2 or more times.* – 0 or more times.+ – 1 or more times.