This file is indexed.

/usr/share/doc/diveintopython-zh/html/html_processing/extracting_data.html is in diveintopython-zh 5.4b-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   
      <title>8.3.&nbsp;从 HTML 文档中提取数据</title>
      <link rel="stylesheet" href="../diveintopython.css" type="text/css">
      <link rev="made" href="mailto:f8dy@diveintopython.org">
      <meta name="generator" content="DocBook XSL Stylesheets V1.52.2">
      <meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free">
      <meta name="description" content="Python from novice to pro">
      <link rel="home" href="../toc/index.html" title="Dive Into Python">
      <link rel="up" href="index.html" title="第&nbsp;8&nbsp;章&nbsp;HTML 处理">
      <link rel="previous" href="introducing_sgmllib.html" title="8.2.&nbsp;sgmllib.py 介绍">
      <link rel="next" href="basehtmlprocessor.html" title="8.4.&nbsp;BaseHTMLProcessor.py 介绍">
   </head>
   <body>
      <table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
         <tr>
            <td id="breadcrumb" colspan="5" align="left" valign="top">导航:<a href="../index.html">起始页</a>&nbsp;&gt;&nbsp;<a href="../toc/index.html">Dive Into Python</a>&nbsp;&gt;&nbsp;<a href="index.html">HTML 处理</a>&nbsp;&gt;&nbsp;<span class="thispage">从 HTML 文档中提取数据</span></td>
            <td id="navigation" align="right" valign="top">&nbsp;&nbsp;&nbsp;<a href="introducing_sgmllib.html" title="上一页: “sgmllib.py 介绍”">&lt;&lt;</a>&nbsp;&nbsp;&nbsp;<a href="basehtmlprocessor.html" title="下一页: “BaseHTMLProcessor.py 介绍”">&gt;&gt;</a></td>
         </tr>
         <tr>
            <td colspan="3" id="logocontainer">
               <h1 id="logo"><a href="../index.html" accesskey="1">深入 Python :Dive Into Python 中文版</a></h1>
               <p id="tagline">Python 从新手到专家 [Dip_5.4b_CPyUG_Release]</p>
            </td>
            <td colspan="3" align="right">
               <form id="search" method="GET" action="http://www.google.com/custom">
                  <p><label for="q" accesskey="4">Find:&nbsp;</label><input type="text" id="q" name="q" size="20" maxlength="255" value=""> <input type="submit" value="搜索"><input type="hidden" name="domains" value="woodpecker.org.cn"><input type="hidden" name="sitesearch" value="www.woodpecker.org.cn/diveintopython"></p>
               </form>
            </td>
         </tr>
      </table>
      <!--#include virtual="/inc/ads" -->
      <div class="section" lang="zh_cn">
         <div class="titlepage">
            <div>
               <div>
                  <h2 class="title"><a name="dialect.extract"></a>8.3.&nbsp;<span class="acronym">HTML</span> 文档中提取数据
                  </h2>
               </div>
            </div>
            <div></div>
         </div>
         <div class="abstract">
            <p>为了从 <span class="acronym">HTML</span> 文档中提取数据,将 <tt class="classname">SGMLParser</tt> 类进行子类化,然后对想要捕捉的标记或实体定义方法。
            </p>
         </div>
         <p><span class="acronym">HTML</span> 文档中提取数据的第一步是得到某个 <span class="acronym">HTML</span> 文件。如果在您的硬盘里存放着 <span class="acronym">HTML</span> 文件,您可以使用<a href="../file_handling/file_objects.html" title="6.2.&nbsp;与文件对象共事">处理文件的函数</a>将它读出来,但是真正有意思的是从实际的网页得到 <span class="acronym">HTML</span></p>
         <div class="example"><a name="dialect.extract.urllib"></a><h3 class="title">&nbsp;8.5.&nbsp;<tt class="filename">urllib</tt> 介绍
            </h3><pre class="screen">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class='pykeyword'>import</span> urllib</span>                                       <a name="dialect.extract.1.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">sock = urllib.urlopen(<span class='pystring'>"http://diveintopython.org/"</span>)</span> <a name="dialect.extract.1.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">htmlSource = sock.read()</span>                            <a name="dialect.extract.1.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">sock.close()</span>                                        <a name="dialect.extract.1.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class='pykeyword'>print</span> htmlSource</span>                                    <a name="dialect.extract.1.5"></a><img src="../images/callouts/5.png" alt="5" border="0" width="12" height="12">
<span class="computeroutput">&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;&lt;html&gt;&lt;head&gt;
      &lt;meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'&gt;
   &lt;title&gt;Dive Into Python&lt;/title&gt;
&lt;link rel='stylesheet' href='diveintopython.css' type='text/css'&gt;
&lt;link rev='made' href='mailto:mark@diveintopython.org'&gt;
&lt;meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'&gt;
&lt;meta name='description' content='a free Python tutorial for experienced programmers'&gt;
&lt;/head&gt;
&lt;body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'&gt;
&lt;table cellpadding='0' cellspacing='0' border='0' width='100%'&gt;
&lt;tr&gt;&lt;td class='header' width='1%' valign='top'&gt;diveintopython.org&lt;/td&gt;
&lt;td width='99%' align='right'&gt;&lt;hr size='1' noshade&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class='tagline' colspan='2'&gt;Python&amp;nbsp;for&amp;nbsp;experienced&amp;nbsp;programmers&lt;/td&gt;&lt;/tr&gt;</span>

[...略...]</pre><div class="calloutlist">
               <table border="0" summary="Callout list">
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.1.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left"><tt class="filename">urllib</tt> 模块是标准 <span class="application">Python</span> 库的一部分。它包含了一些函数,可以从基于互联网的 <span class="acronym">URL</span> (主要指网页) 来获取信息并且真正取回数据。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.1.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left"><tt class="filename">urllib</tt> 模块最简单的使用是提取用 <tt class="function">urlopen</tt> 函数取回的网页的整个文本。打开一个 <span class="acronym">URL</span><a href="../file_handling/file_objects.html" title="6.2.&nbsp;与文件对象共事">打开一个文件</a>相似。<tt class="function">urlopen</tt> 的返回值是像文件一样的对象,它具有一个文件对象一样的方法。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.1.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">使用由 <tt class="function">urlopen</tt> 所返回的类文件对象所能做的最简单的事情就是 <tt class="function">read</tt>,它可以将网页的整个 <span class="acronym">HTML</span> 读到一个字符串中。这个对象也支持 <tt class="function">readlines</tt> 方法,这个方法可以将文本按行放入一个列表中。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.1.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">当用完这个对象,要确保将它 <tt class="function">close</tt>,就如同一个普通的文件对象。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.1.5"><img src="../images/callouts/5.png" alt="5" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">现在我们将 <tt class="systemitem">http://diveintopython.org/</tt> 主页的完整的 <span class="acronym">HTML</span> 保存在一个字符串中了,接着我们将分析它。
                     </td>
                  </tr>
               </table>
            </div>
         </div>
         <div class="example"><a name="dialect.extract.links"></a><h3 class="title">&nbsp;8.6.&nbsp;<tt class="filename">urllister.py</tt> 介绍
            </h3>
            <p>如果您还没有下载本书附带的样例程序, 可以 <a href="http://www.woodpecker.org.cn/diveintopython/download/diveintopython-exampleszh-cn-5.4b.zip" title="Download example scripts">下载本程序和其他样例程序</a></p><pre class="programlisting"><span class='pykeyword'>
from</span> sgmllib <span class='pykeyword'>import</span> SGMLParser

<span class='pykeyword'>class</span><span class='pyclass'> URLLister</span>(SGMLParser):
    <span class='pykeyword'>def</span><span class='pyclass'> reset</span>(self):                              <a name="dialect.extract.2.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
        SGMLParser.reset(self)
        self.urls = []

    <span class='pykeyword'>def</span><span class='pyclass'> start_a</span>(self, attrs):                     <a name="dialect.extract.2.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
        href = [v <span class='pykeyword'>for</span> k, v <span class='pykeyword'>in</span> attrs <span class='pykeyword'>if</span> k==<span class='pystring'>'href'</span>] <a name="dialect.extract.2.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"> <a name="dialect.extract.2.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
        <span class='pykeyword'>if</span> href:
            self.urls.extend(href)</pre><div class="calloutlist">
               <table border="0" summary="Callout list">
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.2.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left"><tt class="function">reset</tt><tt class="classname">SGMLParser</tt><tt class="function">__init__</tt> 方法来调用,也可以在创建一个分析器实例时手工来调用。所以如果您需要做初始化,在 <tt class="function">reset</tt> 中去做,而不要在 <tt class="function">__init__</tt> 中做。这样当某人重用一个分析器实例时,可以正确地重新初始化。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.2.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">只要找到一个 <tt class="sgmltag-element">&lt;a&gt;</tt> 标记,<tt class="function">start_a</tt> 就会由 <tt class="classname">SGMLParser</tt> 进行调用。这个标记可以包含一个 <tt class="literal">href</tt> 属性,或者包含其它的属性,如 <tt class="literal">name</tt><tt class="literal">title</tt><tt class="varname">attrs</tt> 参数是一个 tuple 的 list,<tt class="literal">[(<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), (<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), ...]</tt>。或者它可以只是一个有效的 <span class="acronym">HTML</span> 标记 <tt class="sgmltag-element">&lt;a&gt;</tt> (尽管无用),这时 <tt class="varname">attrs</tt> 将是个空 list。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.2.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">我们可以通过一个简单的<a href="../native_data_types/declaring_variables.html#odbchelper.multiassign" title="3.4.2.&nbsp;一次赋多值">多变量</a> <a href="../native_data_types/mapping_lists.html" title="3.6.&nbsp;映射 list">list 映射</a>来查找这个 <tt class="sgmltag-element">&lt;a&gt;</tt> 标记是否拥有一个 <tt class="literal">href</tt> 属性。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.2.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left"><tt class="literal">k=='href'</tt> 的字符串比较是区分大小写的,但是这里是安全的。因为 <tt class="classname">SGMLParser</tt> 会在创建 <tt class="varname">attrs</tt> 时将属性名转化为小写。
                     </td>
                  </tr>
               </table>
            </div>
         </div>
         <div class="example"><a name="dialect.feed.example"></a><h3 class="title">&nbsp;8.7.&nbsp;使用 <tt class="filename">urllister.py</tt></h3><pre class="screen">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class='pykeyword'>import</span> urllib, urllister</span>
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">usock = urllib.urlopen(<span class='pystring'>"http://diveintopython.org/"</span>)</span>
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">parser = urllister.URLLister()</span>
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">parser.feed(usock.read())</span>         <a name="dialect.extract.3.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">usock.close()</span>                     <a name="dialect.extract.3.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">parser.close()</span>                    <a name="dialect.extract.3.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12">
<tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput"><span class='pykeyword'>for</span> url <span class='pykeyword'>in</span> parser.urls: <span class='pykeyword'>print</span> url</span> <a name="dialect.extract.3.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
<span class="computeroutput">toc/index.html
#download
#languages
toc/index.html
appendix/history.html
download/diveintopython-html-5.0.zip
download/diveintopython-pdf-5.0.zip
download/diveintopython-word-5.0.zip
download/diveintopython-text-5.0.zip
download/diveintopython-html-flat-5.0.zip
download/diveintopython-xml-5.0.zip
download/diveintopython-common-5.0.zip
</span>

...略...</pre><div class="calloutlist">
               <table border="0" summary="Callout list">
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.3.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">调用定义在 <tt class="classname">SGMLParser</tt> 中的 <tt class="function">feed</tt> 方法,将 <span class="acronym">HTML</span> 内容放入分析器中。
                        <sup>[<a name="d0e20660" href="#ftn.d0e20660">4</a>]</sup>
                        这个方法接收一个字符串,这个字符串就是 <tt class="function">usock.read()</tt> 所返回的。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.3.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">像处理文件一样,一旦处理完毕,您应该 <tt class="function">close</tt> 您的 <span class="acronym">URL</span> 对象。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.3.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">您也应该 <tt class="function">close</tt> 您的分析器对象,但出于不同的原因。<tt class="function">feed</tt> 方法不保证对传给它的全部 <span class="acronym">HTML</span> 进行处理,它可能会对其进行缓冲处理,等待接收更多的内容。只要没有更多的内容,就应调用 <tt class="function">close</tt> 来刷新缓冲区,并且强制所有内容被完全处理。
                     </td>
                  </tr>
                  <tr>
                     <td width="12" valign="top" align="left"><a href="#dialect.extract.3.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a> 
                     </td>
                     <td valign="top" align="left">一旦分析器被 <tt class="function">close</tt>,分析过程也就结束了。<tt class="varname">parser.urls</tt> 中包含了在 <span class="acronym">HTML</span> 文档中所有的链接 <span class="acronym">URL</span>。(如果当您读到此处发现输出结果不一样,那是因为下载了本书的更新版本。)
                     </td>
                  </tr>
               </table>
            </div>
         </div>
         <div class="footnotes">
            <h3 class="footnotetitle">Footnotes</h3>
            <div class="footnote">
               <p><sup>[<a name="ftn.d0e20660" href="#d0e20660">4</a>] </sup><tt class="classname">SGMLParser</tt> 这样的分析器,技术术语叫做<span class="emphasis"><em>消费者 (consumer)</em></span>。它消费 <span class="acronym">HTML</span>,并且拆分它。也许因此就选择了 <tt class="function">feed</tt> 这个名字,以便同<span class="emphasis"><em>消费者 </em></span> 这个主题相适应。就个人来说,它让我想象在动物园看展览。里面有一个黑漆漆的兽穴,没有树,没有植物,没有任何生命的迹象。但只要您非常安静地站着,尽可能靠近着瞧,您会看到在远处的角落里有两只明眸在盯着您。但是您会安慰自已那不过是心理作用。唯一知道兽穴里并不是空无一物的方法,就是在栅栏上有一个不明显的标记,上面写着
                  “<span class="quote">禁止给分析器喂食</span>”。但也许只有我这么想,不管怎么样,这种心理想象很有意思。
               </p>
            </div>
         </div>
      </div>
      <table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
         <tr>
            <td width="35%" align="left"><br><a class="NavigationArrow" href="introducing_sgmllib.html">&lt;&lt;&nbsp;sgmllib.py 介绍</a></td>
            <td width="30%" align="center"><br>&nbsp;<span class="divider">|</span>&nbsp;<a href="index.html#dialect.divein" title="8.1.&nbsp;概览">1</a> <span class="divider">|</span> <a href="introducing_sgmllib.html" title="8.2.&nbsp;sgmllib.py 介绍">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="basehtmlprocessor.html" title="8.4.&nbsp;BaseHTMLProcessor.py 介绍">4</a> <span class="divider">|</span> <a href="locals_and_globals.html" title="8.5.&nbsp;locals 和 globals">5</a> <span class="divider">|</span> <a href="dictionary_based_string_formatting.html" title="8.6.&nbsp;基于 dictionary 的字符串格式化">6</a> <span class="divider">|</span> <a href="quoting_attribute_values.html" title="8.7.&nbsp;给属性值加引号">7</a> <span class="divider">|</span> <a href="dialect.html" title="8.8.&nbsp;dialect.py 介绍">8</a> <span class="divider">|</span> <a href="all_together.html" title="8.9.&nbsp;全部放在一起">9</a> <span class="divider">|</span> <a href="summary.html" title="8.10.&nbsp;小结">10</a>&nbsp;<span class="divider">|</span>&nbsp;
            </td>
            <td width="35%" align="right"><br><a class="NavigationArrow" href="basehtmlprocessor.html">BaseHTMLProcessor.py 介绍&nbsp;&gt;&gt;</a></td>
         </tr>
         <tr>
            <td colspan="3"><br></td>
         </tr>
      </table>
      <div class="Footer">
         <p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython.org">Mark Pilgrim</a></p>
         <p class="copyright">Copyright © 2001, 2002, 2003, 2004, 2005, 2006, 2007 <a href="mailto:python-cn@googlegroups.com">CPyUG (邮件列表)</a></p>
      </div>
   </body>
</html>