正则表达式在各种语言上都有实现,使用相同的语法可以得到相同的效果。

match和search的区别

在python中正则表达式匹配有两种:matchsearch

match() 函数从字符串的开头开始匹配,只有当开头匹配成功了,才会继续往下匹配。否则返回None。

search() 函数则是扫描整个字符串并返回第一个成功的匹配。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def Search():
    text = 'Does this text match the pattern?'

    pattern = 'th'
    match = re.search(pattern = pattern, string = text)

    s = match.start()
    e = match.end()

    print("Found \"{pattern}\" in \n\"{text}\" \nfrom {start} to {end} (\"{subtext}\")".format(pattern = pattern,
                                                                                               text = text,
                                                                                               start = s,
                                                                                               end = e,
                                                                                               subtext = text[s:e]))

# output
Found "th" in 
"Does this text match the pattern?" 
from 5 to 7 ("th")


def Match():
    pattern = "th"
    text = 'Does this text match the pattern?'

    match = re.match(pattern = pattern, string = text)

    s = match.start()
    e = match.end()

    print("Found \"{pattern}\" in \n\"{text}\" \nfrom {start} to {end} (\"{subtext}\")".format(pattern = pattern,
                                                                                               text = text,
                                                                                               start = s,
                                                                                               end = e,
                                                                                               subtext = text[s:e]))

#output
AttributeError: 'NoneType' object has no attribute 'start'

编译正则表达式

在一般情况下,我们使用 re.search(p, s) 的方式来使用正则表达式。但当正则表达式的数量过多时,由于每次运行都需要先编译正则表达式,会造成效率的降低。

虽然 re模块 在运行时,会自动保存编译好的正则表达式,但是,这也是有数量限制的,超过这个数量,会清空缓存,重新编译。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# re.python

# 存放编译好的正则表达式的字典
_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

# 最大缓存数量
_MAXCACHE = 512

# 编译正则表达式
def _compile(pattern, flags):
    # internal: compile pattern
    try:
        # 如果在缓存中存在,则直接调用
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        # 缓存数量大于阈值时清空缓存
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        if p.flags & LOCALE:
            if not _locale:
                return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
    return p

在正则表达式 过多的时候,可以使用 compile() 函数,预先编译好正则表达式,然后使用。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def CompilingExpressions():
    regexes = [re.compile(p) for p in ["this", "index", "text"]]
    text = 'Does this text match the pattern?'

    print("Text -> {}\n".format(text))

    for regex in regexes:
        print("Seeking {} -> ".format(regex.pattern), end = "")
        result = regex.search(text)
        print("match" if result else "not match")

# output 
Text -> Does this text match the pattern?

Seeking this -> match
Seeking index -> not match
Seeking text -> match

匹配多个结果

使用 findall()finditer() 函数可以进行多个结果的匹配。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def MultipleMatches():
    text = "'Does this text match the pattern?'"

    pattern_1 = ".*this (.*?) the (.*?)n.*?"
    for match in re.findall(pattern = pattern_1, string = text):
        print(match)

    print()

    pattern_2 = 'th'
    for match in re.finditer(pattern = pattern_2, string = text):
        start = match.start()
        end = match.end()

        print("Found {!r} at {:d}: {:d} ({})".format(pattern_2, start, end, text[start: end]))

# output
('text match', 'patter')

Found 'th' at 6: 8 (th)
Found 'th' at 22: 24 (th)