/usr/share/doc/diveintopython-zh/html/html_processing/extracting_data.html is in diveintopython-zh 5.4b-1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | <!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>8.3. 从 HTML 文档中提取数据</title>
<link rel="stylesheet" href="../diveintopython.css" type="text/css">
<link rev="made" href="mailto:f8dy@diveintopython.org">
<meta name="generator" content="DocBook XSL Stylesheets V1.52.2">
<meta name="keywords" content="Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free">
<meta name="description" content="Python from novice to pro">
<link rel="home" href="../toc/index.html" title="Dive Into Python">
<link rel="up" href="index.html" title="第 8 章 HTML 处理">
<link rel="previous" href="introducing_sgmllib.html" title="8.2. sgmllib.py 介绍">
<link rel="next" href="basehtmlprocessor.html" title="8.4. BaseHTMLProcessor.py 介绍">
</head>
<body>
<table id="Header" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td id="breadcrumb" colspan="5" align="left" valign="top">导航:<a href="../index.html">起始页</a> > <a href="../toc/index.html">Dive Into Python</a> > <a href="index.html">HTML 处理</a> > <span class="thispage">从 HTML 文档中提取数据</span></td>
<td id="navigation" align="right" valign="top"> <a href="introducing_sgmllib.html" title="上一页: “sgmllib.py 介绍”"><<</a> <a href="basehtmlprocessor.html" title="下一页: “BaseHTMLProcessor.py 介绍”">>></a></td>
</tr>
<tr>
<td colspan="3" id="logocontainer">
<h1 id="logo"><a href="../index.html" accesskey="1">深入 Python :Dive Into Python 中文版</a></h1>
<p id="tagline">Python 从新手到专家 [Dip_5.4b_CPyUG_Release]</p>
</td>
<td colspan="3" align="right">
<form id="search" method="GET" action="http://www.google.com/custom">
<p><label for="q" accesskey="4">Find: </label><input type="text" id="q" name="q" size="20" maxlength="255" value=""> <input type="submit" value="搜索"><input type="hidden" name="domains" value="woodpecker.org.cn"><input type="hidden" name="sitesearch" value="www.woodpecker.org.cn/diveintopython"></p>
</form>
</td>
</tr>
</table>
<!--#include virtual="/inc/ads" -->
<div class="section" lang="zh_cn">
<div class="titlepage">
<div>
<div>
<h2 class="title"><a name="dialect.extract"></a>8.3. 从 <span class="acronym">HTML</span> 文档中提取数据
</h2>
</div>
</div>
<div></div>
</div>
<div class="abstract">
<p>为了从 <span class="acronym">HTML</span> 文档中提取数据,将 <tt class="classname">SGMLParser</tt> 类进行子类化,然后对想要捕捉的标记或实体定义方法。
</p>
</div>
<p>从 <span class="acronym">HTML</span> 文档中提取数据的第一步是得到某个 <span class="acronym">HTML</span> 文件。如果在您的硬盘里存放着 <span class="acronym">HTML</span> 文件,您可以使用<a href="../file_handling/file_objects.html" title="6.2. 与文件对象共事">处理文件的函数</a>将它读出来,但是真正有意思的是从实际的网页得到 <span class="acronym">HTML</span>。
</p>
<div class="example"><a name="dialect.extract.urllib"></a><h3 class="title">例 8.5. <tt class="filename">urllib</tt> 介绍
</h3><pre class="screen">
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>import</span> urllib</span> <a name="dialect.extract.1.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">sock = urllib.urlopen(<span class='pystring'>"http://diveintopython.org/"</span>)</span> <a name="dialect.extract.1.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">htmlSource = sock.read()</span> <a name="dialect.extract.1.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">sock.close()</span> <a name="dialect.extract.1.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>print</span> htmlSource</span> <a name="dialect.extract.1.5"></a><img src="../images/callouts/5.png" alt="5" border="0" width="12" height="12">
<span class="computeroutput"><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
<meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
<title>Dive Into Python</title>
<link rel='stylesheet' href='diveintopython.css' type='text/css'>
<link rev='made' href='mailto:mark@diveintopython.org'>
<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
<meta name='description' content='a free Python tutorial for experienced programmers'>
</head>
<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
<table cellpadding='0' cellspacing='0' border='0' width='100%'>
<tr><td class='header' width='1%' valign='top'>diveintopython.org</td>
<td width='99%' align='right'><hr size='1' noshade></td></tr>
<tr><td class='tagline' colspan='2'>Python&nbsp;for&nbsp;experienced&nbsp;programmers</td></tr></span>
[...略...]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><tt class="filename">urllib</tt> 模块是标准 <span class="application">Python</span> 库的一部分。它包含了一些函数,可以从基于互联网的 <span class="acronym">URL</span> (主要指网页) 来获取信息并且真正取回数据。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><tt class="filename">urllib</tt> 模块最简单的使用是提取用 <tt class="function">urlopen</tt> 函数取回的网页的整个文本。打开一个 <span class="acronym">URL</span> 同<a href="../file_handling/file_objects.html" title="6.2. 与文件对象共事">打开一个文件</a>相似。<tt class="function">urlopen</tt> 的返回值是像文件一样的对象,它具有一个文件对象一样的方法。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">使用由 <tt class="function">urlopen</tt> 所返回的类文件对象所能做的最简单的事情就是 <tt class="function">read</tt>,它可以将网页的整个 <span class="acronym">HTML</span> 读到一个字符串中。这个对象也支持 <tt class="function">readlines</tt> 方法,这个方法可以将文本按行放入一个列表中。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">当用完这个对象,要确保将它 <tt class="function">close</tt>,就如同一个普通的文件对象。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.1.5"><img src="../images/callouts/5.png" alt="5" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">现在我们将 <tt class="systemitem">http://diveintopython.org/</tt> 主页的完整的 <span class="acronym">HTML</span> 保存在一个字符串中了,接着我们将分析它。
</td>
</tr>
</table>
</div>
</div>
<div class="example"><a name="dialect.extract.links"></a><h3 class="title">例 8.6. <tt class="filename">urllister.py</tt> 介绍
</h3>
<p>如果您还没有下载本书附带的样例程序, 可以 <a href="http://www.woodpecker.org.cn/diveintopython/download/diveintopython-exampleszh-cn-5.4b.zip" title="Download example scripts">下载本程序和其他样例程序</a>。
</p><pre class="programlisting"><span class='pykeyword'>
from</span> sgmllib <span class='pykeyword'>import</span> SGMLParser
<span class='pykeyword'>class</span><span class='pyclass'> URLLister</span>(SGMLParser):
<span class='pykeyword'>def</span><span class='pyclass'> reset</span>(self): <a name="dialect.extract.2.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
SGMLParser.reset(self)
self.urls = []
<span class='pykeyword'>def</span><span class='pyclass'> start_a</span>(self, attrs): <a name="dialect.extract.2.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
href = [v <span class='pykeyword'>for</span> k, v <span class='pykeyword'>in</span> attrs <span class='pykeyword'>if</span> k==<span class='pystring'>'href'</span>] <a name="dialect.extract.2.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"> <a name="dialect.extract.2.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
<span class='pykeyword'>if</span> href:
self.urls.extend(href)</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><tt class="function">reset</tt> 由 <tt class="classname">SGMLParser</tt> 的 <tt class="function">__init__</tt> 方法来调用,也可以在创建一个分析器实例时手工来调用。所以如果您需要做初始化,在 <tt class="function">reset</tt> 中去做,而不要在 <tt class="function">__init__</tt> 中做。这样当某人重用一个分析器实例时,可以正确地重新初始化。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">只要找到一个 <tt class="sgmltag-element"><a></tt> 标记,<tt class="function">start_a</tt> 就会由 <tt class="classname">SGMLParser</tt> 进行调用。这个标记可以包含一个 <tt class="literal">href</tt> 属性,或者包含其它的属性,如 <tt class="literal">name</tt> 或 <tt class="literal">title</tt>。<tt class="varname">attrs</tt> 参数是一个 tuple 的 list,<tt class="literal">[(<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), (<i class="replaceable">attribute</i>, <i class="replaceable">value</i>), ...]</tt>。或者它可以只是一个有效的 <span class="acronym">HTML</span> 标记 <tt class="sgmltag-element"><a></tt> (尽管无用),这时 <tt class="varname">attrs</tt> 将是个空 list。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">我们可以通过一个简单的<a href="../native_data_types/declaring_variables.html#odbchelper.multiassign" title="3.4.2. 一次赋多值">多变量</a> <a href="../native_data_types/mapping_lists.html" title="3.6. 映射 list">list 映射</a>来查找这个 <tt class="sgmltag-element"><a></tt> 标记是否拥有一个 <tt class="literal">href</tt> 属性。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.2.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">像 <tt class="literal">k=='href'</tt> 的字符串比较是区分大小写的,但是这里是安全的。因为 <tt class="classname">SGMLParser</tt> 会在创建 <tt class="varname">attrs</tt> 时将属性名转化为小写。
</td>
</tr>
</table>
</div>
</div>
<div class="example"><a name="dialect.feed.example"></a><h3 class="title">例 8.7. 使用 <tt class="filename">urllister.py</tt></h3><pre class="screen">
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>import</span> urllib, urllister</span>
<tt class="prompt">>>> </tt><span class="userinput">usock = urllib.urlopen(<span class='pystring'>"http://diveintopython.org/"</span>)</span>
<tt class="prompt">>>> </tt><span class="userinput">parser = urllister.URLLister()</span>
<tt class="prompt">>>> </tt><span class="userinput">parser.feed(usock.read())</span> <a name="dialect.extract.3.1"></a><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">usock.close()</span> <a name="dialect.extract.3.2"></a><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput">parser.close()</span> <a name="dialect.extract.3.3"></a><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12">
<tt class="prompt">>>> </tt><span class="userinput"><span class='pykeyword'>for</span> url <span class='pykeyword'>in</span> parser.urls: <span class='pykeyword'>print</span> url</span> <a name="dialect.extract.3.4"></a><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12">
<span class="computeroutput">toc/index.html
#download
#languages
toc/index.html
appendix/history.html
download/diveintopython-html-5.0.zip
download/diveintopython-pdf-5.0.zip
download/diveintopython-word-5.0.zip
download/diveintopython-text-5.0.zip
download/diveintopython-html-flat-5.0.zip
download/diveintopython-xml-5.0.zip
download/diveintopython-common-5.0.zip
</span>
...略...</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.1"><img src="../images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">调用定义在 <tt class="classname">SGMLParser</tt> 中的 <tt class="function">feed</tt> 方法,将 <span class="acronym">HTML</span> 内容放入分析器中。
<sup>[<a name="d0e20660" href="#ftn.d0e20660">4</a>]</sup>
这个方法接收一个字符串,这个字符串就是 <tt class="function">usock.read()</tt> 所返回的。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.2"><img src="../images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">像处理文件一样,一旦处理完毕,您应该 <tt class="function">close</tt> 您的 <span class="acronym">URL</span> 对象。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.3"><img src="../images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">您也应该 <tt class="function">close</tt> 您的分析器对象,但出于不同的原因。<tt class="function">feed</tt> 方法不保证对传给它的全部 <span class="acronym">HTML</span> 进行处理,它可能会对其进行缓冲处理,等待接收更多的内容。只要没有更多的内容,就应调用 <tt class="function">close</tt> 来刷新缓冲区,并且强制所有内容被完全处理。
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href="#dialect.extract.3.4"><img src="../images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">一旦分析器被 <tt class="function">close</tt>,分析过程也就结束了。<tt class="varname">parser.urls</tt> 中包含了在 <span class="acronym">HTML</span> 文档中所有的链接 <span class="acronym">URL</span>。(如果当您读到此处发现输出结果不一样,那是因为下载了本书的更新版本。)
</td>
</tr>
</table>
</div>
</div>
<div class="footnotes">
<h3 class="footnotetitle">Footnotes</h3>
<div class="footnote">
<p><sup>[<a name="ftn.d0e20660" href="#d0e20660">4</a>] </sup>像 <tt class="classname">SGMLParser</tt> 这样的分析器,技术术语叫做<span class="emphasis"><em>消费者 (consumer)</em></span>。它消费 <span class="acronym">HTML</span>,并且拆分它。也许因此就选择了 <tt class="function">feed</tt> 这个名字,以便同<span class="emphasis"><em>消费者 </em></span> 这个主题相适应。就个人来说,它让我想象在动物园看展览。里面有一个黑漆漆的兽穴,没有树,没有植物,没有任何生命的迹象。但只要您非常安静地站着,尽可能靠近着瞧,您会看到在远处的角落里有两只明眸在盯着您。但是您会安慰自已那不过是心理作用。唯一知道兽穴里并不是空无一物的方法,就是在栅栏上有一个不明显的标记,上面写着
“<span class="quote">禁止给分析器喂食</span>”。但也许只有我这么想,不管怎么样,这种心理想象很有意思。
</p>
</div>
</div>
</div>
<table class="Footer" width="100%" border="0" cellpadding="0" cellspacing="0" summary="">
<tr>
<td width="35%" align="left"><br><a class="NavigationArrow" href="introducing_sgmllib.html"><< sgmllib.py 介绍</a></td>
<td width="30%" align="center"><br> <span class="divider">|</span> <a href="index.html#dialect.divein" title="8.1. 概览">1</a> <span class="divider">|</span> <a href="introducing_sgmllib.html" title="8.2. sgmllib.py 介绍">2</a> <span class="divider">|</span> <span class="thispage">3</span> <span class="divider">|</span> <a href="basehtmlprocessor.html" title="8.4. BaseHTMLProcessor.py 介绍">4</a> <span class="divider">|</span> <a href="locals_and_globals.html" title="8.5. locals 和 globals">5</a> <span class="divider">|</span> <a href="dictionary_based_string_formatting.html" title="8.6. 基于 dictionary 的字符串格式化">6</a> <span class="divider">|</span> <a href="quoting_attribute_values.html" title="8.7. 给属性值加引号">7</a> <span class="divider">|</span> <a href="dialect.html" title="8.8. dialect.py 介绍">8</a> <span class="divider">|</span> <a href="all_together.html" title="8.9. 全部放在一起">9</a> <span class="divider">|</span> <a href="summary.html" title="8.10. 小结">10</a> <span class="divider">|</span>
</td>
<td width="35%" align="right"><br><a class="NavigationArrow" href="basehtmlprocessor.html">BaseHTMLProcessor.py 介绍 >></a></td>
</tr>
<tr>
<td colspan="3"><br></td>
</tr>
</table>
<div class="Footer">
<p class="copyright">Copyright © 2000, 2001, 2002, 2003, 2004 <a href="mailto:mark@diveintopython.org">Mark Pilgrim</a></p>
<p class="copyright">Copyright © 2001, 2002, 2003, 2004, 2005, 2006, 2007 <a href="mailto:python-cn@googlegroups.com">CPyUG (邮件列表)</a></p>
</div>
</body>
</html>
|