Extended Regular Expressions, Grep, Sed, Awk

----------------------1. Extended Regular Expressions-------------------------------
Tutorial: http://www.grymoire.com/Unix/Regular.html

BRIEF
<\ word \><word\> - defining a word (end of word \> beginning of word \<)
[...] - any of the set of characters in []
- ex [123] - any of the digits 1, 2, 3
[123 ] - 1, 2, 3, or space
[a-z] - any lowercase letter ([[::alpha::]])
[aeiou] - any vowel
[^...] - any character except those between []
- ex [^0-9] - any character that is not a digit ([[::digit::]] )
[^,.:] - anything except , . or :
* - previous expression 0 or more times
+ - previous expression 1 or more times

? - previous expression 0 or once

. - any character, once
\ - escape to special characters
^ - beginning of line
$ - end of line

( group-expr ) - group regular expression

{M,N} - M to N duplicates of the previous expression

| - or

---------------------------2. Grep--------------------------------

-E: use extended regular expressions

-i: ignore the case of your search term
-v: show lines that don’t match, instead of those that do
-c: instead of returning matches, return the number of matches
-x: return only an exact match

-m: stop reading file after n number of matches
-n: print the line number of where matches were found
-q: don’t output anything, but exit with status 0 if any match is found (check that status with echo $?).

-o: print only the matching part of the line

-w: find matches surrounded by space
--color: add color to the matched output

ex: find all lines that start with a digit

grep -E --color "^[1234567890]" a.txt

Find all lines from a file that contain the word cat

grep -E “\<cat\>” a.txt

Find all lines from a file that start with a word ending in “ing”

grep -E “^[a-zA-Z]*ing\>” a.txt

Find all lines from a file with odd number of charaters

grep -E "^(..)+$" a.txt

+ vs \+
echo "1+2=" |grep "1+"

echo "1+2=" |grep "1\+"

----------------------------------3. Sed---------------------------------

Find and replace: sed s/find/replace/gi file.txt

-E: use extended regular expressions

g-global replacement

i- case insensitive

\1, \2 ... for refering matching groups from the find section with $ group$ in the replace section

& - matching pattern

Transliterate: sed y/ab/AB/ file.txt

Delete: sed /regex/d file.txt

ex: Replace s with sh

sed -E 's/s/sh/gi' a.txt

Add the prefix 'abc' to each line in a file

sed -E “s/^/abc/” a.txt

Replace any empty line in a file with the word 'empty'

sed -E “s/^$/empty/” a.txt

Rotate with one position every triplet of characters in a file :

ex: abc => bca

sed -E “s/(.)(.)(.)/\3\1\2/g” a.txt

--------------------------------4. Awk-----------------------------------

AWK short intro
The essential organization of an AWK program follows the form:
pattern { action }
BEGIN { print "START" }

/search regex/ { print }

END { print "STOP" }

-Blocks, selectors (BEGIN, END)

Basic systax

a) There are only a few commands in AWK. The list and syntax follows:

if ( conditional ) statement [ else statement ]
while ( conditional ) statement
for ( expression ; conditional ; expression ) statement
for ( variable in array ) statement
break
continue
{ [ statement ] ...}
variable=expression
print [ expression-list ] [ > expression ]
printf format [ , expression-list ] [ > expression ]
next
exit

b)Built in variables
print $0 - reffers to entire line
print $1, $2, $3, $4, $5, $6, $7, $8 - refers to each field

NF -numver of fields variable

NR -number of records (line number)

FILENAME - gives th name of the file being read

c) Change File separator
AWK can be used to parse many system administration files. However, many of these files do not have whitespace as a separator. as an example, the password file uses colons. You can easily change the field separator character to be a colon using the "-F" command line option. The following command will print out accounts that don't have passwords:

awk -F: '{if ($2 == "") print $1 ": no password!"}' /etc/passwd

------------------------------------
#!/bin/awk -f
BEGIN {
FS=":";
}
{
if ( $2 == "" ) {
print $1 ": no password!";
}
}

Useful functions

length(s)
substr(s,p,n)
index(s, t)
split(s, a , c)
sprintf(format, arg, ...)
...

Running awk file scripts
----------------------------------------------------
A simple shell scrip file name hello.sh
#!/bin/bash
echo Hello World

chmod +x hello.sh or chmod 755 hello.sh
./hello.sh
------------------------------------------------------

a)As a shell script file name ex1.sh
#!/bin/sh
# this is a comment
awk '
BEGIN { print "File\tOwner" }
{ print $8, "\t", $3}
END { print " - }

chmod +x ex1.sh
./ex1.sh

b)As an awk script file name ex2.awk
#!/bin/awk -f
BEGIN { print "File\tOwner" }
{ print $8, "\t", $3}
END { print " - DONE -" }

awk -f ex2.awk input

Examples:
Having a file in which each line contains at least 2 numbers separated by space, calculate

a) sum of first 2 numbers on each line

b) sum of first 2 numbers on odd lines

awk '{if (NR % 2 == 1) print $1+$2}' num.txt

c) sum of first 2 numbers on odd lines with more than 5 numbers on a line

awk '{if (NR % 2 == 1 && NF > 5) print $1+$2}' num.txt

d) sum of numbers in the first column of the file

awk 'BEGIN {n=0} {n=n+$1} END {print n}' num.txt

e) sum of all numbers in the file

awk '{for (i=1; i<=NF; i++) n=n+$i} END {print n}' num.txt

Regular expressions in AWK: https://opensource.com/article/19/11/how-regular-expressions-awk

---------------------------------Solved problems:-------------------------

1. Display the lines in /etc/passwd that belong to users having three parent initials in their name, even if the initials do not have a dot after them. You will notice that the extended regular expression accepts things that are not really parent initials, but there isn’t much else that we can do ...

grep -E " [A-Z]\.?[A-Z]\.? [A-Z]\.? " /etc/passwd

2. Display the lines in /etc/passwd that belong to users having names of 12 characters or longer (this year there is one with a 13 character name)

grep -E -i "^([^:]*:){4}[^:]*[a-z]{12,}" /etc/passwd

3. Convert the content of /etc/passwd using a sort of Leet/Calculator spelling (eg Bogdan -> B09d4n)

sed "y/elaoszbg/31405289/" /etc/passwd

4. Convert the content of /etc/passwd surrounding with parentheses and sequence of 3 or more vowels

sed -E "s/([aeiou]{3,})/(\1)/gi" /etc/passwd

5. Display the full names (but only the full names) of the students belonging to group 211

awk -F: '$6 ~ /\/gr211\// {print $5}' /etc/passwd

6. Count the numbers of male and female users in /etc/passwd, accepting as true the following incorrect assumptions: All users have their last name as the first name in the user-info filed (5th field) All women have one of their first or middle names ending in the letter “a”

awk –F: -f prog.awk /etc/passwd

------------prog.awk---

BEGIN { m=0 w=0}
# The space at the beginning of the regular# expressions is for not matching the last name

$5 ~ / [a-zA-Z]*[b-z]\>/ { m++}
$5 ~ / [a-zA-Z]*a\>/ { w++}
END { print "Men:", m print "Women:", w}

----------------------

------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------Lab 4 Problems:----------------------------------------------------------------------

1. Use file /etc/passwd and print out how many groups contain students named Dan with an even student ID number (numar matricol).

2. Print the 3rd column for lines that do not start with a digit.

3. Create a file with the content of manual for the command man. Use grep/sed/awk to select the lines that start with "MAN" or with spaces followed by "MAN" and replace all occurrences of "MAN" with "*star*". Print the first and second column of these lines,separated by dash "-", but only the lines that do not contain "WILL" or "Will" or "will" in the first two columns.

4. Write a shell command that prints out a statistic of the number of processes per user, using commands ps, awk/cut, sort and uniq.

5. Display only the last name of each user in /etc/passwd, considering the last name to be the first word in the 5th field, and accepting it only if it starts with a capital letter

awk -F: '$5 ~ /^[A-Z]/ {print $5}' /etc/passwd | cut -d' ' -f1

or, with awk instead of cut...

awk -F: '$5 ~ /^[A-Z]/ {print $5}' /etc/passwd | awk ‘{print $1}’

6. Extent the solution above to only show the top 10 most frequent last names, ordered descending by their popularity

... | sort | uniq -c | sort -n –r | head –n 10

7. Display all the directories under /etc that contain files with the extension .sh. Each directory should be displayed only once. Hide the permission denied errors given by find.

find /etc -name "*.sh" 2>/dev/null | sed "s/\/[^\/]*$//" | sort|uniq

or simpler by using a different sed separator...

find /etc -name "*.sh" 2>/dev/null | sed "s,/[^/]*$,," | sort|uniq

8. Display in the pager, the number of processes of each username, sorting their usernames descending by their process count.

ps -ef| awk '{print $1}'|sort|uniq -c|sort -n -r | less

9. Display the processes that involve editing a C file

ps -ef| grep "\.c\>"

10. Display in the pager, the usernames with the most logins in the system.

11. Display in the pager the top of usernames by their time spent logged on in the system. The solution will be built gradually following the steps below

a. Display all the usernames and their time spent in the system, ignoring the other fields displayed by command last.

last | awk '{print $1, $10}'
b. Making the time spent in the system field uniform across the output, by adding a 0+ to all entries missing a day element. That is, (03:35) should become (0+03:35).

... |sed -E "s/\(([0-9][0-9]:)/(0+\1/"
c. Calculate the time spent in the system in minutes for each entry

... | sed -E "s/[:+]/ /g"|awk '{print $1, ($2*24*60+$3*60+$4)}'
d. Calculate the total time spent in the system by each user

... | awk –f prog.awk

prog.awk

{

arr[$1] += $2

}

END {

for(u in arr) {

print u, arr[u]

}

or, directly on the command line...

... | awk '{arr[$1] += $2} END {for(u in arr) print u, arr[u]}'

e. Sort the output descending by the time spent in the system and pipe it to the pager

... | awk ‘{print $2, $1}’ | sort -n –r | less
or, simpler...

... | sort -k2nr | less

Homework 1: Read more about these tools in the tutorials (others also on the course page - see bottom section with Resources):

grep : https://www.grymoire.com/Unix/Grep.html
sed: https://www.grymoire.com/Unix/Sed.html
awk : https://www.grymoire.com/Unix/Awk.html
Aditional resources : http://www.cs.ubbcluj.ro/~alinacalin/SO/Labs/Lab%204%202022%20Grep%20Sed%20Awk.pdf

Homework 2: Practice the use of composed commands using grep, sed, awk, find, head, tail, cat, cut, echo, ls, sort, uniq, ps, finger, who, last, tee