<automate the boring stuff with python>---第七章正则实例&正则贪心&匹配电话号码和邮箱

第七章先通过字符串查找电话号码，比较了是否使用正则表达式程序的差异，明显正则写法更为简洁、易扩展。
模式：3 个数字，一个短横线，3个数字，一个短横线，再是4 个数字。例如：415-555-4242

 import re

 '''

 不用正则查找模式，匹配3个数字，1个短横线，3个数字，1个短横线，4个数字

 ex. 111-222-3334

 '''

 def isPhoneNo(text):

     if len(text) != 12:

         return False

     for i in range(0,3):

         if not text[i].isdecimal():

             return False

     if text[3] != '-':

         return False

     for i in range(4,7):

         if not text[i].isdecimal():

             return False

     if text[7] != '-':

         return False

     for i in range(8,12):

         if not text[i].isdecimal():

             return False

     return True

 '''

 用正则表达式匹配上述模式

 '''

 def regPhoneNo(text):

     phoneNoReg=re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

     res=phoneNoReg.search(text)

     if res != None:

         print('phone No find by reg: '+ res.group())

 print(isPhoneNo('123-122-9090'))

 print(isPhoneNo(''))

 msg = 'call me at 415-443-1111 tomorrow. 415-443-2222 is my office'

 for i in range(len(msg)):

     tmp = msg[i:i+12]

     if isPhoneNo(tmp):

         print('phone No find: ' + tmp)

     regPhoneNo(tmp)

 print('msg find end')

Python 的正则表达式默认是“贪心”的，这表示在有二义的情况下，它们会尽可能匹配最长的字符串。

花括号的“非贪心”版本匹配尽可能最短的字符串，即在结束的花括号后跟着一个问号

实例：

'''

例子说明Python的贪心和非贪心匹配结果

'''

def showGreedReg():

    greedReg=re.compile(r'(ha){3,5}')

    nonGreedReg=re.compile(r'(ha){3,5}?')

    inp='hahahahahah'

    r1=greedReg.search(inp)

    r2=nonGreedReg.search(inp)

    print('greed reg res: '+r1.group())

    print('nongreed reg res: '+r2.group())

showGreedReg()

第7章的项目为电话号码和邮箱的正则提取，剪切板部分此处省略。

 import pyperclip, re

 phoneReg=re.compile(r'''(

 (\d{3}|\(\d{3}\))?   #area code

 (\s|-|\.)?     #separator

 (\d{3})    #first 3 digits

 (\s|-|\.)?     #separator

 (\d{4})    #last 4 digits

 (\s*(ext|x|ext.)\s*(\d{2,5}))?

 )''', re.VERBOSE

     )

 emailReg=re.compile(r'''(

 [a-zA-Z0-9_-]+    #username

 @    #@

 [a-zA-Z0-9_-]+    #domain name

 (\.[a-zA-Z]{2,4})

 )''', re.VERBOSE

     )

电话号码从一个“可选的”区号开始，所以区号分组跟着一个问号。

因为区号可能只是3 个数字（即\d{3}），或括号中的3 个数字（即\(\d{3}\)），所以应该用管道符号连接这两部分。

可以对这部分多行字符串加上正则表达式注释# Area code，帮助你记忆(\d{3}|\(\d{3}\))?要匹配的是什么。
电话号码分割字符可以是空格（\s）、短横（-）或句点（.），所以这些部分也应该用管道连接。

这个正则表达式接下来的几部分很简单：3 个数字，接下来是另一个分割符，接下来是4 个数字。

最后的部分是可选的分机号，包括任意数目的空格，
接着ext、x 或ext.，再接着2 到5 位数字。

E-mail 地址的用户名部分是一个或多个字符，字符可以包括：小写和大写字母、数字、句点、下划线、百分号、加号或短横。

可以将所有这些放入一个字符分类：[a-zA-Z0-9._%+-]。
域名和用户名用@符号分割，域名允许的字符分类要少一些，只允许字母、数字、句点和短横：[a-zA-Z0-9.-]。

最后是“dot-com”部分（技术上称为“顶级域名”），它实际上可以是“dot-anything”。它有2 到4 个字符。

re.VERBOSE，忽略正则表达式字符串中的空白符和注释

至此，第七章内容结束，实践项目强口令检测见下期博客

巴特西

<automate the boring stuff with python>---第七章正则实例&正则贪心&匹配电话号码和邮箱

最新文章

热门文章

巴特西

<automate the boring stuff with python>---第七章 正则实例&正则贪心&匹配电话号码和邮箱

最新文章

热门文章

<automate the boring stuff with python>---第七章正则实例&正则贪心&匹配电话号码和邮箱