LAB 8 : Regular Expression in Python

LAB 8 : Regular Expression in Python

Also called as regex. They allow us to search and match specific patterns of text. Raw string – It is nothing, but a regular string prefixed with r, this tells python not to handle any backslashes ‘\’ ,’\n’,’\t’ in any special way

To use the regular expression, we first need to import re.

match() method checks for the given pattern at the start of the string if it finds a match it will return it as match object along with the span().Here span() method defines the start and end index value of the matches object.

x = 'this is my first regular expression'
pattern = re.match(r'this',x)
print(pattern)
<_sre.SRE_Match object; span=(0, 4), match='this'>

Search() method: Used to match pattern anywhere in between the string.If a match is found it return the matched object along with its span.Search() will stop at the first match itself so if we have multiple object matching the pattern then that cannot be done using search() method

import re
x = 'this is my first regular expression'
pattern = re.search(r'first',x)
print(pattern)
<_sre.SRE_Match object; span=(11, 16), match='first'>

Findall() method It will keep on checking the match till no data is available for matching. And returns the result in form of list.

import re
x = 'this is my first regular expression'
pattern = re.findall(r'e',x)
print(pattern)
['e', 'e', 'e']

compile() method compiles a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below. This is useful when you have to use the same object several times in your program.

finditer() method matches the string and returns the value as iterator in the order found. The string is scanned left-to-right, and matches are returned in the order found. The iterator object is exhausted when u call it once.See below example.

import re
x = 'this is my first regular expression'
pattern=re.compile('e')
p=pattern.finditer(x)
for i in p:
    print(i)
<_sre.SRE_Match object; span=(18, 19), match='e'>
<_sre.SRE_Match object; span=(25, 26), match='e'>
<_sre.SRE_Match object; span=(29, 30), match='e'>

Special Characters in Regular expression

\d = matches only numbers (0-9)

\D = matches everything except a number

\w = matches word character (a-z, A-Z, 0-9), small letter, capital letter and numbers but does not match special characters and escape sequences

\W = matches not a word character

\s = matches white space (\n, \t)

\S = matches not whitespace

\b = matches boundary

\B = matched not a boundry

$ = matches end of the string

^ = matches start of the string

[] = matches characters or digits in brackets its called character class or character set

[^] = matches character not in brackets, it return value which is negative of whatever is present in the brackets

Example : Program to match everything except ‘rat’ from the given string

import re
y='cat mat pat rat bat'
pattern = re.compile(r'[^r]at’
matches = pattern.finditer(y)
for match in matches:
    print(match)
<_sre.SRE_Match object; span=(0, 3), match='cat'>
<_sre.SRE_Match object; span=(4, 7), match='mat'>
<_sre.SRE_Match object; span=(8, 11), match='pat'>
<_sre.SRE_Match object; span=(16, 19), match='bat'>

Example: Program to match all ip address in a given string

text_search='''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
123456789
Hi HiHi
Metacharacters needed to be escaped:
.^*$\+?[]|()
rstforum.net
123-456-7891
192.168.120.12
192.168.120.13
1.1.1.1
22.22.22.22
100.100.100.100
Mr. Gupta
Mr Kaushik
Ms.Swati
Mrs Bose
Mr.T
'''
import re
pattern = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}')        
matches = pattern.finditer(text_search)
for match in matches:
    print(match)
<_sre.SRE_Match object; span=(149, 163), match='192.168.120.12'>
<_sre.SRE_Match object; span=(164, 178), match='192.168.120.13'>
<_sre.SRE_Match object; span=(179, 186), match='1.1.1.1'>
<_sre.SRE_Match object; span=(187, 198), match='22.22.22.22'>
<_sre.SRE_Match object; span=(199, 214), match='100.100.100.100'>

Quantifiers in regular expression

In all above example we were matching one digit or one character at a time.We use quantifier if we have to match multiple digits.

  • *  : 0 or more occurrence of special characters
  • + : 1 or more occurrence of special characters
  • ? : 0 or one occurrence of special characters
  • {3} : exact number, these many digits
  • {3,4} : range of number (minimum, maximum)

Example : Program to match names starting with Mr or Mrs

text_search='''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
123456789
Hi HiHi
Metacharacters needed to be escaped:
.^*$\+?[]|()
rstforum.net
123-456-7891
192.168.120.12
192.168.120.13
1.1.1.1
22.22.22.22
100.100.100.100
Mr. Gupta
Mr Kaushik
Mrs Bose
Mr.T
'''
import re
pattern = re.compile(r'Mrs*[. a-zA-Z]*')        
matches = pattern.finditer(text_search)
for match in matches:
    print(match)         
<_sre.SRE_Match object; span=(215, 224), match='Mr. Gupta'>
<_sre.SRE_Match object; span=(225, 235), match='Mr Kaushik'>
<_sre.SRE_Match object; span=(245, 253), match='Mrs Bose'>
<_sre.SRE_Match object; span=(254, 258), match='Mr.T'>

Groups in regular expression

A better way of doing this is using groups which allow you to match specific groups. Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest.

It becomes much easier to solve the above example using groups

text_search='''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
123456789
Hi HiHi
123-456-7891
192.168.120.12
192.168.120.13
1.1.1.1
22.22.22.22
100.100.100.100
Mr. Gupta
Mr Kaushik
Mrs Bose
Mr.T
Ms.Swati
SameerShah@gmail.com
Swati.Kore@education.net
Mahes-98-patil@abc-ty.co.in
'''
import re
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s?[A-Z]\w*')        
matches = pattern.finditer(text_search)
for match in matches:
    print(match)              
<_sre.SRE_Match object; span=(152, 161), match='Mr. Gupta'>
<_sre.SRE_Match object; span=(162, 172), match='Mr Kaushik'>
<_sre.SRE_Match object; span=(173, 181), match='Mrs Bose'>
<_sre.SRE_Match object; span=(182, 186), match='Mr.T'>
<_sre.SRE_Match object; span=(187, 195), match='Ms.Swati'>

Example : program to search all email id’s from given string

text_search='''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
123456789
Hi HiHi
123-456-7891
192.168.120.12
192.168.120.13
1.1.1.1
22.22.22.22
100.100.100.100
Mr. Gupta
Mr Kaushik
Mrs Bose
Mr.T
SameerShah@gmail.com
Swati.Kore@education.net
Mahes-98-patil@abc-ty.co.in
'''
import re
pattern = re.compile(r'[a-zA-Z0-9.-]+@[a-zA-Z-]+\.(com|net|co|in)')        
matches = pattern.finditer(text_search)
for match in matches:
    print(match)               
<_sre.SRE_Match object; span=(187, 207), match='SameerShah@gmail.com'>
<_sre.SRE_Match object; span=(208, 232), match='Swati.Kore@education.net'>
<_sre.SRE_Match object; span=(233, 257), match='Mahes-98-patil@abc-ty.co'>

Search and Replace in regular expression

Another common task is to find all the matches for a pattern, and replace them with a different string. The sub()method takes a replacement value, which can be either a string or a function, and the string to be processed.

import re
x='Simple text with numbers 12345'
pattern = re.sub(r'\d','6',x)       
print(pattern)
Simple text with numbers 66666