Scraping Stackoverflow with Scrapy and MongoDB

Scrapy 是基于 Python 的一个非常流行的网络爬虫框架。本文用 Scrapy 抓取 Stack Overflow 上的问题,
被问到得最频繁的排在前面,并将结果存储到 MongoDB 中。

环境配置

需要安装 Scrapy 和 PyMongo:

1
2
$ mkvirtualenv scrapy
$ pip install scrapy pymongo

创建一个 project

1
$ scrapy startproject stack

一个 Scrapy project 的目录结构通常如下:

1
2
3
4
5
6
7
8
├── scrapy.cfg
└── stack
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py

指定需要抓取的数据

items.py 文件就是用来定义将要抓取的数据的存储“容器”,类似 Python 中的 dict。本文抓取每个问题的
titleurltagsstatus四个属性。因此,修改 items.py 文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field


class StackItem(Item):
title = Field()
url = Field()
tags = Field()
status = Field()

生成Spider

通过 Scrapy 内置的模板生成Spider

1
$ scrapy genspider stack_crawler  stackoverflow.com -t crawl

此时,Scrapy project 的目录结构如下

1
2
3
4
5
6
7
8
9
├── scrapy.cfg
└── stack
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── stack_crawler.py

修改 stack_crawler.py 文件,定义 parse——item 方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class StackCrawlerSpider(CrawlSpider):
name = 'stack_crawler'
allowed_domains = ['stackoverflow.com']
start_urls = [
'http://stackoverflow.com/questions?sort=frequent'
]

rules = (
Rule(LinkExtractor(allow=r'questions\?page=[0-9]&sort=frequent'),
callback='parse_item', follow=True),
)

def parse(self, response):
for href in response.xpath('//div[@class="question-summary"]'):
url = response.urljoin(href.xpath('div/h3/a/@href').extract()[0])
yield scrapy.Request(url, callback=self.parse_item)

def parse_item(self, response):
yield {
'title': response.xpath('//h1/a/text()').extract()[0],
'url': response.url,
'tags': response.xpath('//a[@class="post-tag"]/text()').extract(),
'status': {
'votes': response.xpath(
'//div[@class="vote"]/span/text()').extract()[0],
'favorite_count': response.xpath(
'//div[@class="favoritecount"]/b/text()').extract()[0],
'answers': response.xpath(
'//span[@itemprop="answerCount"]/text()').extract()[0],
'views': response.xpath(
'//td/p[@class="label-key"]/b/text()').extract()[1][:-6],
},
}

StackCrawlerSpider 类中各变量的意义,从变量名就很容易看出了:

  • name 定义 Spider 的名字
  • allowed_domains 是一个列表,列表的每一项是 Spider 允许访问的域名。
  • start_urls 是 Spider 开始抓取的起始 url。
  • rules 定义 Spider 进一步抓取的 url 规则,本例中用来分页抓取前10页的内容。

其中, parse_item 方法中的XPath语法,参见 Scrpay 的文档以及 XPath 文档

存储设置,pipeline

Scrapy 采用 pipeline 机制来对抓取到的数据进行进一步的分析处理,比如持久化。
通过 settings.py 来定义 pipeline 和数据库配置选项:

1
2
3
4
5
6
ITEM_PIPELINES = ['stack.pipelines.MongoDBPipeline', ]

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "stackoverflow"
MONGODB_COLLECTION = "questions"

Pipeline设置

Scrapy 通过 pipeline 连接到数据库,在 pipeline.py 中定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
import logging


class MongoDBPipeline(object):
def __init__(self):
client = pymongo.MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = client.get_database(settings['MONGODB_DB'])
self.collection = db.get_collection(settings['MONGODB_COLLECTION'])

def process_item(self, item, spider):
valid = True
for data in item:
if not data:
valid = False
raise DropItem('Missing {0}'.format(data))
if valid:
self.collection.update({'url': item['url']},
dict(item), upsert=True)
logging.log(logging.INFO, 'Question added to MongoDB database!')
return item

抓取数据

启动 Spider 开始抓取数据。

1
$ scrapy crawl stack_crawler

完整代码见Github