web抓取python模块_如何用python抓取网页上的数据

1. 如何用python抓取这个网页的内容

Python实现常规的静态网页抓取时，往往是用urllib2来获取整个HTML页面，然后从HTML文件中逐字查找对应的关键字。如下所示：
复制代码代码如下:

import urllib2
url="网址"
up=urllib2.urlopen(url)#打开目标页面，存入变量up
cont=up.read()#从up中读入该HTML文件
key1='<a href="http'#设置关键字1
key2="target"#设置关键字2
pa=cont.find(key1)#找出关键字1的位置
pt=cont.find(key2,pa)#找出关键字2的位置(从字1后面开始查找)
urlx=cont[pa:pt]#得到关键字1与关键字2之间的内容(即想要的数据)
print urlx

2. 如何用python抓取网页上的数据

使用内置的包来抓取，就是在模仿浏览器访问页面，再把页面的数据给解析出来，也可以看做是一次请求。

3. 求python抓网页的代码

python3.x中使用urllib.request模块来抓取网页代码，通过urllib.request.urlopen函数取网页内容，获取的为数据流，通过read()函数把数字读取出来，再把读取的二进制数据通过decode函数解码（编号可以通过查看网页源代码中<meta http-equiv="content-type" content="text/html;charset=gbk" />得知，如下例中为gbk编码。），这样就得到了网页的源代码。

如下例所示，抓取本页代码：

importurllib.request

html=urllib.request.urlopen('
).read().decode('gbk')#注意抓取后要按网页编码进行解码
print(html)

以下为urllib.request.urlopen函数说明：

urllib.request.urlopen(url,
data=None, [timeout, ]*, cafile=None, capath=None,
cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

data must be a bytes object specifying additional data to be sent to
the server, or None
if no such data is needed. data may also be an iterable object and in
that case Content-Length value must be specified in the headers. Currently HTTP
requests are the only ones that use data; the HTTP request will be a
POST instead of a GET when the data parameter is provided.

data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or
sequence of 2-tuples and returns a string in this format. It should be encoded
to bytes before being used as the data parameter. The charset parameter
in Content-Type
header may be used to specify the encoding. If charset parameter is not sent
with the Content-Type header, the server following the HTTP 1.1 recommendation
may assume that the data is encoded in ISO-8859-1 encoding. It is advisable to
use charset parameter with encoding used in Content-Type header with the Request.

urllib.request mole uses HTTP/1.1 and includes Connection:close header
in its HTTP requests.

The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the global
default timeout setting will be used). This actually only works for HTTP, HTTPS
and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL
options. See HTTPSConnection for more details.

The optional cafile and capath parameters specify a set of
trusted CA certificates for HTTPS requests. cafile should point to a
single file containing a bundle of CA certificates, whereas capath
should point to a directory of hashed certificate files. More information can be
found in ssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

For http and https urls, this function returns a http.client.HTTPResponse object which has the
following HTTPResponse
Objects methods.

For ftp, file, and data urls and requests explicitly handled by legacy URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager and has methods such as

geturl() — return the URL of the resource retrieved,
commonly used to determine if a redirect was followed

info() — return the meta-information of the page, such
as headers, in the form of an email.message_from_string() instance (see Quick
Reference to HTTP Headers)

getcode() – return the HTTP status code of the response.

Raises URLError on errors.

Note that None
may be returned if no handler handles the request (though the default installed
global OpenerDirector uses UnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment
variable like http_proxy is set), ProxyHandler is default installed and makes sure the
requests are handled through the proxy.

The legacy urllib.urlopen function from Python 2.6 and earlier has
been discontinued; urllib.request.urlopen() corresponds to the old
urllib2.urlopen.
Proxy handling, which was done by passing a dictionary parameter to urllib.urlopen, can be
obtained by using ProxyHandler objects.

Changed in version 3.2: cafile
and capath were added.

Changed in version 3.2: HTTPS virtual
hosts are now supported if possible (that is, if ssl.HAS_SNI is true).

New in version 3.2: data can be
an iterable object.

Changed in version 3.3: cadefault
was added.

Changed in version 3.4.3: context
was added.

4. 如何用 Python 实现 Web 抓取

Web 抓取的定义
Web 抓取是抽取网络数据的过程。只要借助合适的工具，任何你能看到的数据都可以进行抽取。在本文中，我们将重点介绍自动化抽取过程的程序，帮助你在较短时间内收集大量数据。除了笔者前文提到的用例，抓取技术的用途还包括：SEO 追踪、工作追踪、新闻分析以及笔者的最爱——社交媒体的情感分析！
一点提醒
在开启 Web 抓取的探险之前，请确保自己了解相关的法律问题。许多网站在其服务条款中明确禁止对其内容进行抓取。例如，Medium 网站就写道：“遵照网站 robots.txt 文件中的规定进行的爬取操作(Crawling)是可接受的，但是我们禁止抓取(Scraping)操作。”对不允许抓取的网站进行抓取可能会使你进入他们的黑名单！与任何工具一样，Web 抓取也可能用于复制网站内容之类的不良目的。此外，由 Web 抓取引起的法律诉讼也不在少数。
设置代码
在充分了解小心行事的必要之后，让我们开始学习 Web 抓取。其实，Web 抓取可以通过任何编程语言实现，在不久之前，我们使用 Node 实现过。在本文中，考虑到其简洁性与丰富的包支持，我们将使用 Python 实现抓取程序。
Web 抓取的基本过程
当你打开网络中的某个站点时，就会下载其 HTML 代码，由你的 web 浏览器对其进行分析与展示。该 HTML 代码包含了你所看到的所有信息。因此，通过分析 HTML 代码就能得到所需信息（比如价格）。你可以使用正则表达式在数据海洋中搜索你需要的信息，也可以使用函数库来解释 HTML，同样也能得到需要数据。
在 Python 中，我们将使用一个名为靓汤（Beautiful Soup）的模块对 HTML 数据进行分析。你可以借助 pip 之类的安装程序安装之，运行如下代码即可：
pip install beautifulsoup4

或者，你也可以根据源码进行构建。在该模块的文档说明页，可以看到详细的安装步骤。
安装完成之后，我们大致会遵循以下步骤实现 web 抓取：
向 URL 发送请求
接收响应
分析响应以寻找所需数据
作为演示，我们将使用笔者的博客 http://dada.theblogbowl.in/. 作为目标 URL。
前两个步骤相对简单，可以这样完成：
from urllib import urlopen#Sending the http requestwebpage = urlopen('http://my_website.com/').read()

接下来，将响应传给之前安装的模块：
from bs4 import BeautifulSoup#making the soup! yummy ;)soup = BeautifulSoup(webpage, "html5lib")

请注意，此处我们选择了 html5lib 作为解析器。根据 BeautifulSoup 的文档，你也可以为其选择不同的解析器。
解析 HTML
在将 HTML 传给 BeautifulSoup 之后，我们可以尝试一些指令。譬如，检查 HTML 标记代码是否正确，可以验证该页面的标题（在 Python 解释器中）：
>>> soup.title<title>Transcendental Tech Talk</title>>>> soup.title.text
u'Transcendental Tech Talk'
>>>

接下来，开始抽取页面中的特定元素。譬如，我想抽取博客中文章标题的列表。为此，我需要分析 HTML 的结构，这一点可以借助 Chrome 检查器完成。其他浏览器也提供了类似的工具。

使用 Chrome 检查器检查某个页面的 HTML 结构
如你所见，所有文章标题都带有 h3 标签与两个类属性：post-title 与 entry-title 类。因此，用 post-title类搜索所有 h3 元素就能得到该页的文章标题列表。在此例中，我们使用 BeautifulSoup 提供的 find_all 函数，并通过 class_ 参数确定所需的类：
>>> titles = soup.find_all('h3', class_ = 'post-title') #Getting all titles>>> titles[0].textu'\nKolkata #BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips\n'>>>

只通过 post-title 类进行条目搜索应该可以得到相同的结果：
>>> titles = soup.find_all(class_ = 'post-title') #Getting all items with class post-title>>> titles[0].textu'\nKolkata #BergerXP
IndiBlogger meet, Marketing Insights, and some Blogging Tips\n'>>>

5. 如何用Python爬虫抓取网页内容

爬虫流程
其实把网络爬虫抽象开来看，它无外乎包含如下几个步骤
模拟请求网页。模拟浏览器，打开目标网站。
获取数据。打开网站之后，就可以自动化的获取我们所需要的网站数据。
保存数据。拿到数据之后，需要持久化到本地文件或者数据库等存储设备中。
那么我们该如何使用 Python 来编写自己的爬虫程序呢，在这里我要重点介绍一个 Python 库：Requests。
Requests 使用
Requests 库是 Python 中发起 HTTP 请求的库，使用非常方便简单。
模拟发送 HTTP 请求
发送 GET 请求
当我们用浏览器打开豆瓣首页时，其实发送的最原始的请求就是 GET 请求
import requests
res = requests.get('http://www.douban.com')
print(res)
print(type(res))
>>>
<Response [200]>
<class 'requests.models.Response'>

6. 如何用 Python 实现 Web 抓取

给你一个例子以下代码调试通过：

#coding=utf-8
importurllib
defgetHtml(url):
page=urllib.urlopen(url)
html=page.read()
returnhtml
html=getHtml("https://.com/")
printhtml

运行效果：

7. Python爬虫是什么

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。
其实通俗的讲就是通过程序去获取web页面上自己想要的数据，也就是自动抓取数据。
爬虫可以做什么？
你可以用爬虫爬图片，爬取视频等等你想要爬取的数据，只要你能通过浏览器访问的数据都可以通过爬虫获取。
爬虫的本质是什么？
模拟浏览器打开网页，获取网页中我们想要的那部分数据
浏览器打开网页的过程：
当你在浏览器中输入地址后，经过DNS服务器找到服务器主机，向服务器发送一个请求，服务器经过解析后发送给用户浏览器结果，包括html,js,css等文件内容，浏览器解析出来最后呈现给用户在浏览器上看到的结果。
所以用户看到的浏览器的结果就是由HTML代码构成的，我们爬虫就是为了获取这些内容，通过分析和过滤html代码，从中获取我们想要资源。

8. 如何用python抓取网页特定内容

最简单可以用urllib，python2.x和python3.x的用法不同，以python2.x为例：

importurllib
html=urllib.open(url)
text=html.read()

复杂些可以用requests库，支持各种请求类型，支持cookies，header等

再复杂些的可以用selenium，支持抓取javascript产生的文本

我设计了简单的爬虫闯关网站 www.heibanke.com/lesson/crawler_ex00/

新手如果能自己把三关闯过，相信一定会有所收获。

题解在课程里有提到http://study.163.com/course/courseMain.htm?courseId=1000035

9. 如何用 Python 实现 Web 抓取

#!/usr/bin/envpython
#-*-coding:utf-8-*-
#bycarlin.wang
#请参考


importurllib
importurllib2
importtime
importos
importrandom
frombs4importBeautifulSoup
defget_Html(url):
headers={"User-Agent":"Mozilla/5.0(WindowsNT6.1;WOW64;rv:41.0)Gecko/20100101Firefox/41.0"}
req=urllib2.Request(url,headers=headers)
res=urllib2.urlopen(req)
res_html=res.read().decode('UTF-8')
returnres_html

defurlPages(page):
url="http://jandan.net/ooxx/page-"+str(page)+'#comments'
returnurl

deffind_img_url(html):
soup=BeautifulSoup(html,'html.parser',from_encoding='utf-8')
img_urls=soup.find_all(class_='view_img_link')
returnimg_urlsdefdownload_img(url):
fdir="D:/data/jiandan"
ifnotos.path.exists(fdir):
os.makedirs(fdir)
try:
#(p2)=os.path.split(url)
#(p2,f2)=os.path.split(url)
f2=''.join(map(lambdaxx:(hex(ord(xx))[2:]),os.urandom(16)))#随机字符串作为文件名字，防止名字重复
#ifos.path.exists(fdir+"/"+f2):
#print"fdirisexists"
ifurl:
imgtype=url.split('/')[4].split('.')[1]
filename,msg=urllib.urlretrieve(url,fdir+"/"+f2+'.'+imgtype)
ifos.path.getsize(filename)<100:
os.remove(filename)
exceptException,e:
return"downimageerror!"


defrun():
forpageinrange(2001,2007):
html=get_Html(urlPages(page))
urls=find_img_url(html)
forurlinurls:
s=url.get('href')
prints
download_img(s)

if__name__=='__main__':
run()

web抓取python模块

与web抓取python模块相关的内容