This file is indexed.

/usr/share/doc/diveintopython-zh/html/performance_tuning/regular_expressions.html is in diveintopython-zh 5.4b-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   
      <title>18.3.&nbsp;优化正则表达式</title>
      <link rel="stylesheet" href="../diveintopython.css" type="text/css">
      <link rev="made" href="mailto:f8dy@diveintopython.org">
      <meta name="generator" content="DocBook XSL Stylesheets V1.52.2">
      <meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free">
      <meta name="description" content="Python from novice to pro">
      <link rel="home" href="../toc/index.html" title="Dive Into Python">
      <link rel="up" href="index.html" title="第&nbsp;18&nbsp;章&nbsp;性能优化">
      <link rel="previous" href="timeit.html" title="18.2.&nbsp;使用 timeit 模块">
      <link rel="next" href="dictionary_lookups.html" title="18.4.&nbsp;优化字典查找">
   </head>
   <body>
      <table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
         <tr>
            <td id="breadcrumb" colspan="5" align="left" valign="top">导航:<a href="../index.html">起始页</a>&nbsp;&gt;&nbsp;<a href="../toc/index.html">Dive Into Python</a>&nbsp;&gt;&nbsp;<a href="index.html">性能优化</a>&nbsp;&gt;&nbsp;<span class="thispage">优化正则表达式</span></td>
            <td id="navigation" align="right" valign="top">&nbsp;&nbsp;&nbsp;<a href="timeit.html" title="上一页: “使用 timeit 模块”">&lt;&lt;</a>&nbsp;&nbsp;&nbsp;<a href="dictionary_lookups.html" title="下一页: “优化字典查找”">&gt;&gt;</a></td>
         </tr>
         <tr>
            <td colspan="3" id="logocontainer">
               <h1 id="logo"><a href="../index.html" accesskey="1">深入 Python :Dive Into Python 中文版</a></h1>
               <p id="tagline">Python 从新手到专家 [Dip_5.4b_CPyUG_Release]</p>
            </td>
            <td colspan="3" align="right">
               <form id="search" method="GET" action="http://www.google.com/custom">
                  <p><label for="q" accesskey="4">Find:&nbsp;</label><input type="text" id="q" name="q" size="20" maxlength="255" value=""> <input type="submit" value="搜索"><input type="hidden" name="domains" value="woodpecker.org.cn"><input type="hidden" name="sitesearch" value="www.woodpecker.org.cn/diveintopython"></p>
               </form>
            </td>
         </tr>
      </table>
      <!--#include virtual="/inc/ads" -->
      <div class="section" lang="zh_cn">
         <div class="titlepage">
            <div>
               <div>
                  <h2 class="title"><a name="soundex.stage1"></a>18.3.&nbsp;优化正则表达式
                  </h2>
               </div>
            </div>
            <div></div>
         </div>
         <div class="abstract">
            <p> Soundex 函数的第一件事是检查输入是否是一个空字符串。怎样做是最好的方法?</p>
         </div>
         <p>如果你回答 “<span class="quote">正则表达式</span>”,坐在角落里反省你糟糕的直觉。正则表达式几乎永远不是最好的答案,而且应该被尽可能避开。这不仅仅是基于性能考虑,而是因为调试和维护都很困难,当然性能也是个原因。
         </p>
         <p>这是 <tt class="filename">soundex/stage1/soundex1a.py</tt> 检查 <tt class="varname">source</tt> 是否全部由字母构成的一段代码,至少是一个字母 (而不是空字符串):
         </p>
         <div class="informalexample"><pre class="programlisting">
    allChars = string.uppercase + string.lowercase
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[%s]+$'</span> % allChars, source):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p><tt class="filename">soundex1a.py</tt> 表现如何?为了方便,<tt class="literal">__main__</tt> 部分包含了一段代码:调用 <tt class="filename">timeit</tt> 模块,为三个不同名字分别建立测试,依次测试,并显示每个测试的最短耗时:
         </p>
         <div class="informalexample"><pre class="programlisting"><span class='pykeyword'>
if</span> __name__ == <span class='pystring'>'__main__'</span>:
    <span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
    names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
    <span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
        statement = <span class='pystring'>"soundex('%s')"</span> % name
        t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
        <span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
         <p>那么,应用正则表达式的 <tt class="filename">soundex1a.py</tt> 表现如何呢?
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1a.py</span>
<span class="computeroutput">Woo             W000 19.3356647283
Pilgrim         P426 24.0772053431
Flingjingwaller F452 35.0463220884</span>
</pre></div>
         <p>正如你预料,名字越长,算法耗时就越长。有几个工作可以令我们减小这个差距 (使函数对于长输入花费较短的相对时间) 但是算法的本质决定它不可能每次运行时间都相同。</p>
         <p>另一点应铭记于心的是,我们测试的是有代表性的名字样本。<tt class="literal">Woo</tt> 是个被缩短到单字符并补零的小样本;<tt class="literal">Pilgrim</tt> 是个夹带着特别字符和忽略字符的平均长度的正常样本;<tt class="literal">Flingjingwaller</tt> 是一个包含连续重复字符并且特别长的样本。其它的测试可能同样有帮助,但它们已经很好地代表了不同的样本范围。
         </p>
         <p>那么那个正则表达式如何呢?嗯,缺乏效率。因为这个表达式测试不止一个范围的字符 (<tt class="literal">A-Z</tt> 的大写范围和 <tt class="literal">a-z</tt> 的小写字母范围),我们可以使用一个正则表达式的缩写语法。这便是 <tt class="filename">soundex/stage1/soundex1b.py</tt>:
         </p>
         <div class="informalexample"><pre class="programlisting">
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> re.search(<span class='pystring'>'^[A-Za-z]+$'</span>, source):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p><tt class="filename">timeit</tt> 显示 <tt class="filename">soundex1b.py</tt><tt class="filename">soundex1a.py</tt> 稍微快一些,但是没什么令人激动的变化:
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1b.py</span>
<span class="computeroutput">Woo             W000 17.1361133887
Pilgrim         P426 21.8201693232
Flingjingwaller F452 32.7262294509</span>
</pre></div>
         <p><a href="../refactoring/refactoring.html" title="15.3.&nbsp;重构">&nbsp;15.3&nbsp;节 “重构”</a> 中我们看到正则表达式可以被编译并在重用时以更快速度获得结果。因为这个正则表达式在函数中每次被调用时都不变化,我们可以编译它一次并使用被编译的版本。这便是 <tt class="filename">soundex/stage1/soundex1c.py</tt></p>
         <div class="informalexample"><pre class="programlisting">
isOnlyChars = re.compile(<span class='pystring'>'^[A-Za-z]+$'</span>).search
<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> isOnlyChars(source):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p><tt class="filename">soundex1c.py</tt> 中使用被编译的正则表达式产生了显著的提速:
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1c.py</span>
<span class="computeroutput">Woo             W000 14.5348347346
Pilgrim         P426 19.2784703084
Flingjingwaller F452 30.0893873383</span>
</pre></div>
         <p>但是这样的优化是正路吗?这里的逻辑很简单:输入 <tt class="varname">source</tt> 应该是非空,并且需要完全由字母构成。如果编写一个循环查看每个字符并且抛弃正则表达式,是否会更快些?
         </p>
         <p>这便是 <tt class="filename">soundex/stage1/soundex1d.py</tt></p>
         <div class="informalexample"><pre class="programlisting">
    <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> source:
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
    <span class='pykeyword'>for</span> c <span class='pykeyword'>in</span> source:
        <span class='pykeyword'>if</span> <span class='pykeyword'>not</span> (<span class='pystring'>'A'</span> &lt;= c &lt;= <span class='pystring'>'Z'</span>) <span class='pykeyword'>and</span> <span class='pykeyword'>not</span> (<span class='pystring'>'a'</span> &lt;= c &lt;= <span class='pystring'>'z'</span>):
            <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p>这个技术在 <tt class="filename">soundex1d.py</tt> 中恰好<span class="emphasis"><em>不及</em></span> 编译后的正则表达式快 (尽管比使用未编译的正则表达式快<sup>[<a name="d0e39385" href="#ftn.d0e39385">14</a>]</sup>):
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1d.py</span>
<span class="computeroutput">Woo             W000 15.4065058548
Pilgrim         P426 22.2753567842
Flingjingwaller F452 37.5845122774</span>
</pre></div>
         <p>为什么 <tt class="filename">soundex1d.py</tt> 没能更快?答案来自 <span class="application">Python</span> 的编译本质。正则表达式引擎以 C 语言编写,被编译后则能本能地在你的计算机上运行。另一方面,循环是以 <span class="application">Python</span> 编写,要通过 <span class="application">Python</span> 解释器。尽管循环相对简单,但没能简单到补偿花在代码解释上的时间。正则表达式永远不是正确答案……但例外还是存在的。
         </p>
         <p>恰巧 <span class="application">Python</span> 提供了一个晦涩的字符串方法。你有理由不了解它,因为本书未曾提到它。这个方法便是 <tt class="methodname">isalpha()</tt>,它检查一个字符串是否只包含字母。
         </p>
         <p>这便是 <tt class="filename">soundex/stage1/soundex1e.py</tt></p>
         <div class="informalexample"><pre class="programlisting">
    <span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
</pre></div>
         <p><tt class="filename">soundex1e.py</tt> 中应用这个特殊方法我们能得到多少好处?  很多。
         </p>
         <div class="informalexample"><pre class="screen">
<tt class="prompt">C:\samples\soundex\stage1&gt;</tt><span class="userinput">python soundex1e.py</span>
<span class="computeroutput">Woo             W000 13.5069504644
Pilgrim         P426 18.2199394057
Flingjingwaller F452 28.9975225902</span>
</pre></div>
         <div class="example"><a name="d0e39452"></a><h3 class="title">&nbsp;18.3.&nbsp;目前为止最好的结果:<tt class="filename">soundex/stage1/soundex1e.py</tt></h3><pre class="programlisting"><span class='pykeyword'>
import</span> string, re

charToSoundex = {<span class='pystring'>"A"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"B"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"C"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"D"</span>: <span class='pystring'>"3"</span>,
                 <span class='pystring'>"E"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"F"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"G"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"H"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"I"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"J"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"K"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"L"</span>: <span class='pystring'>"4"</span>,
                 <span class='pystring'>"M"</span>: <span class='pystring'>"5"</span>,
                 <span class='pystring'>"N"</span>: <span class='pystring'>"5"</span>,
                 <span class='pystring'>"O"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"P"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"Q"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"R"</span>: <span class='pystring'>"6"</span>,
                 <span class='pystring'>"S"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"T"</span>: <span class='pystring'>"3"</span>,
                 <span class='pystring'>"U"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"V"</span>: <span class='pystring'>"1"</span>,
                 <span class='pystring'>"W"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"X"</span>: <span class='pystring'>"2"</span>,
                 <span class='pystring'>"Y"</span>: <span class='pystring'>"9"</span>,
                 <span class='pystring'>"Z"</span>: <span class='pystring'>"2"</span>}

<span class='pykeyword'>def</span><span class='pyclass'> soundex</span>(source):
    <span class='pykeyword'>if</span> (<span class='pykeyword'>not</span> source) <span class='pykeyword'>and</span> (<span class='pykeyword'>not</span> source.isalpha()):
        <span class='pykeyword'>return</span> <span class='pystring'>"0000"</span>
    source = source[0].upper() + source[1:]
    digits = source[0]
    <span class='pykeyword'>for</span> s <span class='pykeyword'>in</span> source[1:]:
        s = s.upper()
        digits += charToSoundex[s]
    digits2 = digits[0]
    <span class='pykeyword'>for</span> d <span class='pykeyword'>in</span> digits[1:]:
        <span class='pykeyword'>if</span> digits2[-1] != d:
            digits2 += d
    digits3 = re.sub(<span class='pystring'>'9'</span>, <span class='pystring'>''</span>, digits2)
    <span class='pykeyword'>while</span> len(digits3) &lt; 4:
        digits3 += <span class='pystring'>"0"</span>
    <span class='pykeyword'>return</span> digits3[:4]

<span class='pykeyword'>if</span> __name__ == <span class='pystring'>'__main__'</span>:
    <span class='pykeyword'>from</span> timeit <span class='pykeyword'>import</span> Timer
    names = (<span class='pystring'>'Woo'</span>, <span class='pystring'>'Pilgrim'</span>, <span class='pystring'>'Flingjingwaller'</span>)
    <span class='pykeyword'>for</span> name <span class='pykeyword'>in</span> names:
        statement = <span class='pystring'>"soundex('%s')"</span> % name
        t = Timer(statement, <span class='pystring'>"from __main__ import soundex"</span>)
        <span class='pykeyword'>print</span> name.ljust(15), soundex(name), min(t.repeat())
</pre></div>
         <div class="footnotes">
            <h3 class="footnotetitle">Footnotes</h3>
            <div class="footnote">
               <p><sup>[<a name="ftn.d0e39385" href="#d0e39385">14</a>] </sup>注意 <tt class="filename">soundex1d.py</tt> 在后两个测试点上都比 <tt class="filename">soundex1b.py</tt> 慢,这点与作者所说的矛盾。本章另还有多处出现了正文与测试结果矛盾的地方,每个地方都会用译注加以说明。这个 bug 将在下个版本中得到修正。――译注
               </p>
            </div>
         </div>
      </div>
      <table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
         <tr>
            <td width="35%" align="left"><br><a class="NavigationArrow" href="timeit.html">&lt;&lt;&nbsp;使用 timeit 模块</a></td>
            <td width="30%" align="center"><br>&nbsp;<span class="divider">|</span>&nbsp;<a href="index.html#soundex.divein" title="18.1.&nbsp;概览">1</a> <span class="divider">|</span> <a href="timeit.html" title="18.2.&nbsp;使用 timeit 模块">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="dictionary_lookups.html" title="18.4.&nbsp;优化字典查找">4</a> <span class="divider">|</span> <a href="list_operations.html" title="18.5.&nbsp;优化列表操作">5</a> <span class="divider">|</span> <a href="string_manipulation.html" title="18.6.&nbsp;优化字符串操作">6</a> <span class="divider">|</span> <a href="summary.html" title="18.7.&nbsp;小结">7</a>&nbsp;<span class="divider">|</span>&nbsp;
            </td>
            <td width="35%" align="right"><br><a class="NavigationArrow" href="dictionary_lookups.html">优化字典查找&nbsp;&gt;&gt;</a></td>
         </tr>
         <tr>
            <td colspan="3"><br></td>
         </tr>
      </table>
      <div class="Footer">
         <p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython.org">Mark Pilgrim</a></p>
         <p class="copyright">Copyright © 2001, 2002, 2003, 2004, 2005, 2006, 2007 <a href="mailto:python-cn@googlegroups.com">CPyUG (邮件列表)</a></p>
      </div>
   </body>
</html>