最新文章专题视频专题问答1问答10问答100问答1000问答2000关键字专题1关键字专题50关键字专题500关键字专题1500TAG最新视频文章推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37视频文章20视频文章30视频文章40视频文章50视频文章60 视频文章70视频文章80视频文章90视频文章100视频文章120视频文章140 视频2关键字专题关键字专题tag2tag3文章专题文章专题2文章索引1文章索引2文章索引3文章索引4文章索引5123456789101112131415文章专题3
当前位置: 首页 - 科技 - 知识百科 - 正文

讲解Python的Scrapy爬虫框架使用代理进行采集的方法

来源:动视网 责编:小采 时间:2020-11-27 14:35:22
文档

讲解Python的Scrapy爬虫框架使用代理进行采集的方法

讲解Python的Scrapy爬虫框架使用代理进行采集的方法:1.在Scrapy工程下新建middlewares.py# Importing base library because we'll need it ONLY in case if the proxy we are going to use requires authentication import base # Start y
推荐度:
导读讲解Python的Scrapy爬虫框架使用代理进行采集的方法:1.在Scrapy工程下新建middlewares.py# Importing base library because we'll need it ONLY in case if the proxy we are going to use requires authentication import base # Start y


1.在Scrapy工程下新建“middlewares.py”

# Importing base library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base

# Start your middleware class
class ProxyMiddleware(object):
 # overwrite process request
 def process_request(self, request, spider):
 # Set the location of the proxy
 request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

 # Use the following lines if your proxy requires authentication
 proxy_user_pass = "USERNAME:PASSWORD"
 # setup basic authentication for the proxy
 encoded_user_pass = base.encodestring(proxy_user_pass)
 request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
 'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
 name = "test"
 domain_name = "whatismyip.com"
 # The following url is subject to change, you can get the last updated one from here :
 # http://www.whatismyip.com/faq/automation.asp
 start_urls = ["http://xujian.info"]

 def parse(self, response):
 open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
 'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
 }

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码

#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
 self.user_agent = user_agent

 def process_request(self, request, spider):
 #这句话用于随机选择user-agent
 ua = random.choice(self.user_agent_list)
 if ua:
 request.headers.setdefault('User-Agent', ua)

 #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 user_agent_list = [
 "Mozilla/5.0 (Windows NT 6.1; WOW) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
 "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
 "Mozilla/5.0 (Windows NT 6.1; WOW) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
 "Mozilla/5.0 (Windows NT 6.2; WOW) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
 "Mozilla/5.0 (X11; Linux x86_) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
 "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
 "Mozilla/5.0 (Windows NT 6.1; WOW) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
 "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
 "Mozilla/5.0 (Windows NT 6.1; WOW) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
 "Mozilla/5.0 (Windows NT 6.1; WOW) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
 "Mozilla/5.0 (X11; Linux x86_) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
 "Mozilla/5.0 (Windows NT 6.2; WOW) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
 ]

文档

讲解Python的Scrapy爬虫框架使用代理进行采集的方法

讲解Python的Scrapy爬虫框架使用代理进行采集的方法:1.在Scrapy工程下新建middlewares.py# Importing base library because we'll need it ONLY in case if the proxy we are going to use requires authentication import base # Start y
推荐度:
标签: 代理 python 爬虫
  • 热门焦点

最新推荐

猜你喜欢

热门推荐

专题
Top