/usr/share/doc/diveintopython-zh/html/performance_tuning/regular_expressions.html is in diveintopython-zh 5.4b-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | <!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>18.3. 优化正则表达式</title>
<link rel="stylesheet" href="../diveintopython.css" type="text/css">
<link rev="made" href="mailto:f8dy@diveintopython.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.52.2">
<meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free">
<meta name="description" content="Python from novice to pro">
<link rel="home" href="../toc/index.html" title="Dive Into Python">
<link rel="up" href="index.html" title="第 18 章 性能优化">
<link rel="previous" href="timeit.html" title="18.2. 使用 timeit 模块">
<link rel="next" href="dictionary_lookups.html" title="18.4. 优化字典查找">
</head>
<body>
<table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td id="breadcrumb" colspan="5" align="left" valign="top">导航:<a href="../index.html">起始页</a> > <a href="../toc/index.html">Dive Into Python</a> > <a href="index.html">性能优化</a> > <span class="thispage">优化正则表达式</span></td>
<td id="navigation" align="right" valign="top"> <a href="timeit.html" title="上一页: “使用 timeit 模块”"><<</a> <a href="dictionary_lookups.html" title="下一页: “优化字典查找”">>></a></td>
</tr>
<tr>
<td colspan="3" id="logocontainer">
<h1 id="logo"><a href="../index.html" accesskey="1">深入 Python :Dive Into Python 中文版</a></h1>
<p id="tagline">Python 从新手到专家 [Dip_5.4b_CPyUG_Release]</p>
</td>
<td colspan="3" align="right">
<form id="search" method="GET" action="http://www.google.com/custom">
<p><label for="q" accesskey="4">Find: </label><input type="text" id="q" name="q" size="20" maxlength="255" value=""> <input type="submit" value="搜索"><input type="hidden" name="domains" value="woodpecker.org.cn"><input type="hidden" name="sitesearch" value="www.woodpecker.org.cn/diveintopython"></p>
</form>
</td>
</tr>
</table>
<!--#include virtual="/inc/ads" -->
<div class="section" lang="zh_cn">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a name="soundex.stage1"></a>18.3. 优化正则表达式
</h2>
</div>
</div>
<div></div>
</div>
<div class="abstract">
<p> Soundex 函数的第一件事是检查输入是否是一个空字符串。怎样做是最好的方法?</p>
</div>
<p>如果你回答 “<span class="quote">正则表达式</span>”,坐在角落里反省你糟糕的直觉。正则表达式几乎永远不是最好的答案,而且应该被尽可能避开。这不仅仅是基于性能考虑,而是因为调试和维护都很困难,当然性能也是个原因。
</p>
<p>这是 <tt class="filename">soundex/stage1/soundex1a.py</tt> 检查 <tt class="varname">source</tt> 是否全部由字母构成的一段代码,至少是一个字母 (而不是空字符串):
</p>
<div class="informalexample"><pre class="programlisting">
allChars = string.uppercase + string.lowercase
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[%s]+$'</span> % allChars, source):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p><tt class="filename">soundex1a.py</tt> 表现如何?为了方便,<tt class="literal">__main__</tt> 部分包含了一段代码:调用 <tt class="filename">timeit</tt> 模块,为三个不同名字分别建立测试,依次测试,并显示每个测试的最短耗时:
</p>
<div class="informalexample"><pre class="programlisting"><span class='pykeyword'>
if</span> __name__ == <span class='pystring'>'__main__'</span>:
<span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
<span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
statement = <span class='pystring'>"soundex('%s')"</span> % name
t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
<span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
<p>那么,应用正则表达式的 <tt class="filename">soundex1a.py</tt> 表现如何呢?
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1a.py</span>
<span class="computeroutput">Woo W000 19.3356647283
Pilgrim P426 24.0772053431
Flingjingwaller F452 35.0463220884</span>
</pre></div>
<p>正如你预料,名字越长,算法耗时就越长。有几个工作可以令我们减小这个差距 (使函数对于长输入花费较短的相对时间) 但是算法的本质决定它不可能每次运行时间都相同。</p>
<p>另一点应铭记于心的是,我们测试的是有代表性的名字样本。<tt class="literal">Woo</tt> 是个被缩短到单字符并补零的小样本;<tt class="literal">Pilgrim</tt> 是个夹带着特别字符和忽略字符的平均长度的正常样本;<tt class="literal">Flingjingwaller</tt> 是一个包含连续重复字符并且特别长的样本。其它的测试可能同样有帮助,但它们已经很好地代表了不同的样本范围。
</p>
<p>那么那个正则表达式如何呢?嗯,缺乏效率。因为这个表达式测试不止一个范围的字符 (<tt class="literal">A-Z</tt> 的大写范围和 <tt class="literal">a-z</tt> 的小写字母范围),我们可以使用一个正则表达式的缩写语法。这便是 <tt class="filename">soundex/stage1/soundex1b.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[A-Za-z]+$'</span>, source):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p><tt class="filename">timeit</tt> 显示 <tt class="filename">soundex1b.py</tt> 比 <tt class="filename">soundex1a.py</tt> 稍微快一些,但是没什么令人激动的变化:
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1b.py</span>
<span class="computeroutput">Woo W000 17.1361133887
Pilgrim P426 21.8201693232
Flingjingwaller F452 32.7262294509</span>
</pre></div>
<p>在 <a href="../refactoring/refactoring.html" title="15.3. 重构">第 15.3 节 “重构”</a> 中我们看到正则表达式可以被编译并在重用时以更快速度获得结果。因为这个正则表达式在函数中每次被调用时都不变化,我们可以编译它一次并使用被编译的版本。这便是 <tt class="filename">soundex/stage1/soundex1c.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
isOnlyChars = re.compile(<span class='pystring'>'^[A-Za-z]+$'</span>).search
<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> isOnlyChars(source):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p><tt class="filename">soundex1c.py</tt> 中使用被编译的正则表达式产生了显著的提速:
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1c.py</span>
<span class="computeroutput">Woo W000 14.5348347346
Pilgrim P426 19.2784703084
Flingjingwaller F452 30.0893873383</span>
</pre></div>
<p>但是这样的优化是正路吗?这里的逻辑很简单:输入 <tt class="varname">source</tt> 应该是非空,并且需要完全由字母构成。如果编写一个循环查看每个字符并且抛弃正则表达式,是否会更快些?
</p>
<p>这便是 <tt class="filename">soundex/stage1/soundex1d.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> source:
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
<span class='pykeyword'>for</span> c <span class='pykeyword'>in</span> source:
<span class='pykeyword'>if</span> <span class='pykeyword'>not</span> (<span class='pystring'>'A'</span> <= c <= <span class='pystring'>'Z'</span>) <span class='pykeyword'>and</span> <span class='pykeyword'>not</span> (<span class='pystring'>'a'</span> <= c <= <span class='pystring'>'z'</span>):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p>这个技术在 <tt class="filename">soundex1d.py</tt> 中恰好<span class="emphasis"><em>不及</em></span> 编译后的正则表达式快 (尽管比使用未编译的正则表达式快<sup>[<a name="d0e39385" href="#ftn.d0e39385">14</a>]</sup>):
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1d.py</span>
<span class="computeroutput">Woo W000 15.4065058548
Pilgrim P426 22.2753567842
Flingjingwaller F452 37.5845122774</span>
</pre></div>
<p>为什么 <tt class="filename">soundex1d.py</tt> 没能更快?答案来自 <span class="application">Python</span> 的编译本质。正则表达式引擎以 C 语言编写,被编译后则能本能地在你的计算机上运行。另一方面,循环是以 <span class="application">Python</span> 编写,要通过 <span class="application">Python</span> 解释器。尽管循环相对简单,但没能简单到补偿花在代码解释上的时间。正则表达式永远不是正确答案……但例外还是存在的。
</p>
<p>恰巧 <span class="application">Python</span> 提供了一个晦涩的字符串方法。你有理由不了解它,因为本书未曾提到它。这个方法便是 <tt class="methodname">isalpha()</tt>,它检查一个字符串是否只包含字母。
</p>
<p>这便是 <tt class="filename">soundex/stage1/soundex1e.py</tt>:
</p>
<div class="informalexample"><pre class="programlisting">
<span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
<p>在 <tt class="filename">soundex1e.py</tt> 中应用这个特殊方法我们能得到多少好处? 很多。
</p>
<div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1></tt><span class="userinput">python soundex1e.py</span>
<span class="computeroutput">Woo W000 13.5069504644
Pilgrim P426 18.2199394057
Flingjingwaller F452 28.9975225902</span>
</pre></div>
<div class="example"><a name="d0e39452"></a><h3 class="title">例 18.3. 目前为止最好的结果:<tt class="filename">soundex/stage1/soundex1e.py</tt></h3><pre class="programlisting"><span class='pykeyword'>
import</span> string, re
charToSoundex = {<span class='pystring'>"A"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"B"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"C"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"D"</span>: <span class='pystring'>"3"</span>,
<span class='pystring'>"E"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"F"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"G"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"H"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"I"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"J"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"K"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"L"</span>: <span class='pystring'>"4"</span>,
<span class='pystring'>"M"</span>: <span class='pystring'>"5"</span>,
<span class='pystring'>"N"</span>: <span class='pystring'>"5"</span>,
<span class='pystring'>"O"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"P"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"Q"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"R"</span>: <span class='pystring'>"6"</span>,
<span class='pystring'>"S"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"T"</span>: <span class='pystring'>"3"</span>,
<span class='pystring'>"U"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"V"</span>: <span class='pystring'>"1"</span>,
<span class='pystring'>"W"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"X"</span>: <span class='pystring'>"2"</span>,
<span class='pystring'>"Y"</span>: <span class='pystring'>"9"</span>,
<span class='pystring'>"Z"</span>: <span class='pystring'>"2"</span>}
<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
<span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
<span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
source = source[0].upper() + source[1:]
digits = source[0]
<span class='pykeyword'>for</span> s <span class='pykeyword'>in</span> source[1:]:
s = s.upper()
digits += charToSoundex[s]
digits2 = digits[0]
<span class='pykeyword'>for</span> d <span class='pykeyword'>in</span> digits[1:]:
<span class='pykeyword'>if</span> digits2[-1] != d:
digits2 += d
digits3 = re.sub(<span class='pystring'>'9'</span>, <span class='pystring'>''</span>, digits2)
<span class='pykeyword'>while</span> len(digits3) < 4:
digits3 += <span class='pystring'>"0"</span>
<span class='pykeyword'>return</span> digits3[:4]
<span class='pykeyword'>if</span> __name__ == <span class='pystring'>'__main__'</span>:
<span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
<span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
statement = <span class='pystring'>"soundex('%s')"</span> % name
t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
<span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
<div class="footnotes">
<h3 class="footnotetitle">Footnotes</h3>
<div class="footnote">
<p><sup>[<a name="ftn.d0e39385" href="#d0e39385">14</a>] </sup>注意 <tt class="filename">soundex1d.py</tt> 在后两个测试点上都比 <tt class="filename">soundex1b.py</tt> 慢,这点与作者所说的矛盾。本章另还有多处出现了正文与测试结果矛盾的地方,每个地方都会用译注加以说明。这个 bug 将在下个版本中得到修正。――译注
</p>
</div>
</div>
</div>
<table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td width="35%" align="left"><br><a class="NavigationArrow" href="timeit.html"><< 使用 timeit 模块</a></td>
<td width="30%" align="center"><br> <span class="divider">|</span> <a href="index.html#soundex.divein" title="18.1. 概览">1</a> <span class="divider">|</span> <a href="timeit.html" title="18.2. 使用 timeit 模块">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="dictionary_lookups.html" title="18.4. 优化字典查找">4</a> <span class="divider">|</span> <a href="list_operations.html" title="18.5. 优化列表操作">5</a> <span class="divider">|</span> <a href="string_manipulation.html" title="18.6. 优化字符串操作">6</a> <span class="divider">|</span> <a href="summary.html" title="18.7. 小结">7</a> <span class="divider">|</span>
</td>
<td width="35%" align="right"><br><a class="NavigationArrow" href="dictionary_lookups.html">优化字典查找 >></a></td>
</tr>
<tr>
<td colspan="3"><br></td>
</tr>
</table>
<div class="Footer">
<p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython.org">Mark Pilgrim</a></p>
<p class="copyright">Copyright © 2001, 2002, 2003, 2004, 2005, 2006, 2007 <a href="mailto:python-cn@googlegroups.com">CPyUG (邮件列表)</a></p>
</div>
</body>
</html>
|