這篇文章將為大家詳細(xì)講解有關(guān)在scrapy中使用selenium實(shí)現(xiàn)一個(gè)爬取網(wǎng)頁的功能,文章內(nèi)容質(zhì)量較高,因此小編分享給大家做個(gè)參考,希望大家閱讀完這篇文章后對相關(guān)知識有一定的了解。
1.背景
2. 環(huán)境
3.原理分析
3.1. 分析request請求的流程
首先看一下scrapy新的架構(gòu)圖:
部分流程:
第一:爬蟲引擎生成requests請求,送往scheduler調(diào)度模塊,進(jìn)入等待隊(duì)列,等待調(diào)度。
第二:scheduler模塊開始調(diào)度這些requests,出隊(duì),發(fā)往爬蟲引擎。
第三:爬蟲引擎將這些requests送到下載中間件(多個(gè),例如加header,代理,自定義等等)進(jìn)行處理。
第四:處理完之后,送往Downloader模塊進(jìn)行下載。從這個(gè)處理過程來看,突破口就在下載中間件部分,用selenium直接處理掉request請求。
3.2. requests和response中間處理件源碼分析
相關(guān)代碼位置:
源碼解析:
# 文件:E:\Miniconda\Lib\site-packages\scrapy\core\downloader\middleware.py """ Downloader Middleware manager See documentation in docs/topics/downloader-middleware.rst """ import six from twisted.internet import defer from scrapy.http import Request, Response from scrapy.middleware import MiddlewareManager from scrapy.utils.defer import mustbe_deferred from scrapy.utils.conf import build_component_list class DownloaderMiddlewareManager(MiddlewareManager): component_name = 'downloader middleware' @classmethod def _get_mwlist_from_settings(cls, settings): # 從settings.py或這custom_setting中拿到自定義的Middleware中間件 ''' 'DOWNLOADER_MIDDLEWARES': { 'mySpider.middlewares.ProxiesMiddleware': 400, # SeleniumMiddleware 'mySpider.middlewares.SeleniumMiddleware': 543, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, }, ''' return build_component_list( settings.getwithbase('DOWNLOADER_MIDDLEWARES')) # 將所有自定義Middleware中間件的處理函數(shù)添加到對應(yīng)的methods列表中 def _add_middleware(self, mw): if hasattr(mw, 'process_request'): self.methods['process_request'].append(mw.process_request) if hasattr(mw, 'process_response'): self.methods['process_response'].insert(0, mw.process_response) if hasattr(mw, 'process_exception'): self.methods['process_exception'].insert(0, mw.process_exception) # 整個(gè)下載流程 def download(self, download_func, request, spider): @defer.inlineCallbacks def process_request(request): # 處理request請求,依次經(jīng)過各個(gè)自定義Middleware中間件的process_request方法,前面有加入到list中 for method in self.methods['process_request']: response = yield method(request=request, spider=spider) assert response is None or isinstance(response, (Response, Request)), \ 'Middleware %s.process_request must return None, Response or Request, got %s' % \ (six.get_method_self(method).__class__.__name__, response.__class__.__name__) # 這是關(guān)鍵地方 # 如果在某個(gè)Middleware中間件的process_request中處理完之后,生成了一個(gè)response對象 # 那么會直接將這個(gè)response return 出去,跳出循環(huán),不再處理其他的process_request # 之前我們的header,proxy中間件,都只是加個(gè)user-agent,加個(gè)proxy,并不做任何return值 # 還需要注意一點(diǎn):就是這個(gè)return的必須是Response對象 # 后面我們構(gòu)造的HtmlResponse正是Response的子類對象 if response: defer.returnValue(response) # 如果在上面的所有process_request中,都沒有返回任何Response對象的話 # 最后,會將這個(gè)加工過的Request送往download_func,進(jìn)行下載,返回的就是一個(gè)Response對象 # 然后依次經(jīng)過各個(gè)Middleware中間件的process_response方法進(jìn)行加工,如下 defer.returnValue((yield download_func(request=request,spider=spider))) @defer.inlineCallbacks def process_response(response): assert response is not None, 'Received None in process_response' if isinstance(response, Request): defer.returnValue(response) for method in self.methods['process_response']: response = yield method(request=request, response=response, spider=spider) assert isinstance(response, (Response, Request)), \ 'Middleware %s.process_response must return Response or Request, got %s' % \ (six.get_method_self(method).__class__.__name__, type(response)) if isinstance(response, Request): defer.returnValue(response) defer.returnValue(response) @defer.inlineCallbacks def process_exception(_failure): exception = _failure.value for method in self.methods['process_exception']: response = yield method(request=request, exception=exception, spider=spider) assert response is None or isinstance(response, (Response, Request)), \ 'Middleware %s.process_exception must return None, Response or Request, got %s' % \ (six.get_method_self(method).__class__.__name__, type(response)) if response: defer.returnValue(response) defer.returnValue(_failure) deferred = mustbe_deferred(process_request, request) deferred.addErrback(process_exception) deferred.addCallback(process_response) return deferred
當(dāng)前標(biāo)題:在scrapy中使用selenium實(shí)現(xiàn)一個(gè)爬取網(wǎng)頁的功能-創(chuàng)新互聯(lián)
分享地址:http://aaarwkj.com/article42/dohdec.html
成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供企業(yè)網(wǎng)站制作、網(wǎng)站營銷、網(wǎng)站設(shè)計(jì)公司、企業(yè)建站、網(wǎng)站維護(hù)、移動(dòng)網(wǎng)站建設(shè)
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請盡快告知,我們將會在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場,如需處理請聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來源: 創(chuàng)新互聯(lián)
猜你還喜歡下面的內(nèi)容