正则表达式在各种语言上都有实现,使用相同的语法可以得到相同的效果。
match和search的区别
在python中正则表达式匹配有两种:match 和 search。
match() 函数从字符串的开头开始匹配,只有当开头匹配成功了,才会继续往下匹配。否则返回None。
search() 函数则是扫描整个字符串并返回第一个成功的匹配。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| def Search():
text = 'Does this text match the pattern?'
pattern = 'th'
match = re.search(pattern = pattern, string = text)
s = match.start()
e = match.end()
print("Found \"{pattern}\" in \n\"{text}\" \nfrom {start} to {end} (\"{subtext}\")".format(pattern = pattern,
text = text,
start = s,
end = e,
subtext = text[s:e]))
# output
Found "th" in
"Does this text match the pattern?"
from 5 to 7 ("th")
def Match():
pattern = "th"
text = 'Does this text match the pattern?'
match = re.match(pattern = pattern, string = text)
s = match.start()
e = match.end()
print("Found \"{pattern}\" in \n\"{text}\" \nfrom {start} to {end} (\"{subtext}\")".format(pattern = pattern,
text = text,
start = s,
end = e,
subtext = text[s:e]))
#output
AttributeError: 'NoneType' object has no attribute 'start'
|
编译正则表达式
在一般情况下,我们使用 re.search(p, s) 的方式来使用正则表达式。但当正则表达式的数量过多时,由于每次运行都需要先编译正则表达式,会造成效率的降低。
虽然 re模块 在运行时,会自动保存编译好的正则表达式,但是,这也是有数量限制的,超过这个数量,会清空缓存,重新编译。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| # re.python
# 存放编译好的正则表达式的字典
_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
# 最大缓存数量
_MAXCACHE = 512
# 编译正则表达式
def _compile(pattern, flags):
# internal: compile pattern
try:
# 如果在缓存中存在,则直接调用
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
# 缓存数量大于阈值时清空缓存
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
|
在正则表达式 过多的时候,可以使用 compile() 函数,预先编译好正则表达式,然后使用。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| def CompilingExpressions():
regexes = [re.compile(p) for p in ["this", "index", "text"]]
text = 'Does this text match the pattern?'
print("Text -> {}\n".format(text))
for regex in regexes:
print("Seeking {} -> ".format(regex.pattern), end = "")
result = regex.search(text)
print("match" if result else "not match")
# output
Text -> Does this text match the pattern?
Seeking this -> match
Seeking index -> not match
Seeking text -> match
|
匹配多个结果
使用 findall() 或 finditer() 函数可以进行多个结果的匹配。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| def MultipleMatches():
text = "'Does this text match the pattern?'"
pattern_1 = ".*this (.*?) the (.*?)n.*?"
for match in re.findall(pattern = pattern_1, string = text):
print(match)
print()
pattern_2 = 'th'
for match in re.finditer(pattern = pattern_2, string = text):
start = match.start()
end = match.end()
print("Found {!r} at {:d}: {:d} ({})".format(pattern_2, start, end, text[start: end]))
# output
('text match', 'patter')
Found 'th' at 6: 8 (th)
Found 'th' at 22: 24 (th)
|
Author
Alfons
LastMod
0001-01-01
License
Creative Commons BY-NC-ND 3.0