Using Scrapy with Proxies

当使用爬虫抓取网页上的内容时,很多站点对爬虫不是很有好,会有限制访问者IP等反爬机制。此时,使用代理就是一个非常好的选择。而 Scrapy 提供了 HttpProxyMiddleware 来支持代理。

本文使用的代理IP是从这里抓取下来的,以 xxx.xxx.xxx.xxx:port 的格式放在 scrapy_project/proxy.txt。然后,Scrapy 从中随机选取一个IP作为 proxy。

1 要使 Scrapy 通过代理进行抓取,首先需要在 settings.py 里面进行相应设置,

1
2
3
4
5
6
7
8
9
10
11
# Retry when proxies fail
RETRY_TIMES = 3

# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 80,
'scrapy_project.middlewares.ProxyMiddleware': 90,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100,
}

2 在 middlewares.py 中定义 ProxyMiddleware 类,将代理IP添加到request.meta中,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# -*- coding: utf-8 -*-
"""
This a simple proxy for scrapy.

The proxy host format like: `http://host:port` or `http://username:password@host:port`
"""


import random


class ProxyMiddleware(object):
"""Custom ProxyMiddleware."""
def __init__(self, settings):
self.proxy_list = settings.get('PROXY_LIST')
with open(self.proxy_list) as f:
self.proxies = [ip.strip() for ip in f]

def parse_request(self, request, spider):
request.meta['proxy'] = 'http://{}'.format(random.choice(self.proxies))

另外,还可以简单地通过环境变量来设置代理IP

1
export http_proxy = http://ip:port