初识 Meteor

发表于 2015-07-13 | 分类于 Meteor |

最近发现了个很有趣、很强大的东西 Meteor，这一个构建在 Node.js 和 MongoDB 之上的，用于构建 Web 端和移动端实时应用的 Full Stack 开源平台。Meteor 位于程序数据库和用户界面之间,保持二者之间的数据同步更新。Meteor 在客户端和服务器端都使用 JavaScript 作为开发语言，因此代码能在前后两端共用。Meteor 还有活跃的社区和丰富的第三方packages 支持。

Meteor 的安装非常简单，只需要一条命令(当然是在已经安装了 Node.js 的情况下，否则先安装 Node.js)：

1	$ curl https://install.meteor.com/ \| sh

安装好之后，就可以用来构建有趣的应用了。Meteor 自带了几个样例可以试用、参考。

$ meter create --list
Available examples:
  clock
  leaderboard
  localmarket
  todos

Create a project from an example with 'meteor create --example <name>'.

其中，todos 是一个 Todos 应用，列出了 Meteor 的主要特性。按照上面提示创建 Todos，并运行：

$ meter create --example todos
$ cd todos
$ meteor
[[[[[ ~/program/meteor/todos ]]]]]

=> Started proxy.
=> Started MongoDB.
=> Started your app.

=> App running at: http://localhost:3000/

在浏览器打开http://localhost:3000/：

Meteor todos

Data on the Wire. Meteor 并不在网络上发生HTML，而是将一切‘编译’成 JavaScript 和 CSS 再交给 Client(比如浏览器) 渲染。
One Language. Meteor 在 Server 和 Client 端都用 JavaScript。
Database Everywhere. Meteor 的 Client 端有自己的 Database–Minimongo，Client 只与自己本地的 Minimongo 进行数据交互。Meteor 使用一种 Pub/Sub 模型来控制 MongoDB 服务器与 Minimongo 客户端之间同步的数据。默认情况下，所有服务器端 Meteor 集合都会被发布。Meteor 使用 DDP（Distributed Data Protocol，分布式数据协议）在客户端与服务器之间移动数据。
Latency Compensation. 在 Client 端，Meteor 会预抓取(prefetch)数据，然后模拟出像是 Server 端及时响应过来的。Minimongo 使用 latency compensation 来反映数据库更改，从根本上讲，延迟补偿是大数据管理领域中的最终一致性概念的一种视觉表现。当通过 Minimongo stub 更新客户端上的数据时，任何更改都会立即在客户端上反映出来，包括反应性重新呈现。这些更改还会被传播到服务器。但是，传播的更改可能会失败，失败的原因有很多，包括拒绝访问。Pub/Sub 机制负责确保客户端最终(通常很快)反映了服务器的实际状态。延迟补偿可实现一种无需等待的、响应非常迅速的 UI，这是现代 Web 2.0 应用程序的一个鲜明特征。而代价可能是出现短暂的视觉数据不一致性。
Full Stack Reactivity. Meteor 中从数据库(database)到模板层(template)一切变化都是实时的，任何更新都会自动同步。
Embrace the Ecosystem. Meteor 是一个开源平台，同时集成了许多优秀的开源工具和框架，比如能通过命令 meteor add-platform android 和 meteor add-platform ios 分别添加 Android 和 iOS 模拟器。
Simplicity Equals Productivity.

与传统的LAMP(或LEMP)架构不同，Meteor 实际上采用 Pub/Sub 机制，以 Database Everywhere方式，在 Server 端为 MongoDB，在 Client 端为 Minimongo；借助 Websocket 在 Server 端和 Client 端之间实现永久连接；Client 只与本地 Minimongo进行数据操作，因此速度非常快，同时以 Ajax 方式通过 Websocket 与 Server实现数据同步。

Meteor 是如此的强大，它将当前最新、最先进技术组合起来，使用起来却非常方便快捷，非常值得学习和使用。而本文只是对 Meteor 的简单介绍，还需要对其进一步深入的学习和使用。

主要的学习资料：

官方文档当然通常都是最权威的；
Meteor 开发者团队出的 Discover Meteor 也是非常值得推荐的资料；
Meteor Forums；
使用，并在 Gitgub 上为 Meteor 贡献代码。

What Is a Metaclass In Python

发表于 2015-07-11 | 分类于 Python |

Python中关于metaclass的问题，Stack overflow上有一相当精彩的描述：

http://stackoverflow.com/questions/100003/what-is-a-metaclass-in-python

Matlab Automatic Broadcasting Operation Applied

发表于 2015-07-03 | 分类于 Matlab |

在使用 Matlab 做 machine learning 练习时，将两个不同维度的矩阵相减，结果正确但出现了如下的警告：

1	warning: operator -: automatic broadcasting operation applied

出现此警告的根源在于两个矩阵的维度不一致。

但，我们知道，在Matlab中一个矩阵与一个实数进行四则运算，实际上是对矩阵的没一个矩阵元做相应的运算。那么，是否可以对此规则进行推广呢？比如，一个$M \times N$ 的矩阵 A 减去一个 $1 \times N$ 的向量 b，就相当于用 A 的每一行与 b 相减，它们之间维度一致，可以进行计算，结果也符合我们的期望。 For example,

octave:3> a
a =

1   3
1   3
1   3
1   3
1   3
octave:4> a - [1 2]
warning: operator -: automatic broadcasting operation applied
ans =

0   1
0   1
0   1
0   1
0   1

应该如何避免警告信息呢？ Google 了一番之后，在 stackoverflow.com 上发现同样的问题。Matlab 内置函数 bsxfun or repmat 可以解决这个问题。

octave:5> a - repmat([1 2], 5, 1)
ans =

0   1
0   1
0   1
0   1
0   1

octave:6> bsxfun(@minus, a, [1 2])
ans =

0   1
0   1
0   1
0   1
0   1

bsxfun 和 repmat 的具体用法，可以查阅文档，故不在此赘述。

Using Scrapy with Proxies

发表于 2015-07-01 | 分类于 Python |

当使用爬虫抓取网页上的内容时，很多站点对爬虫不是很有好，会有限制访问者IP等反爬机制。此时，使用代理就是一个非常好的选择。而 Scrapy 提供了 HttpProxyMiddleware 来支持代理。

本文使用的代理IP是从这里抓取下来的，以 xxx.xxx.xxx.xxx:port 的格式放在 scrapy_project/proxy.txt。然后，Scrapy 从中随机选取一个IP作为 proxy。

1 要使 Scrapy 通过代理进行抓取，首先需要在 settings.py 里面进行相应设置，

# Retry when proxies fail
RETRY_TIMES = 3

# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 80,
    'scrapy_project.middlewares.ProxyMiddleware': 90,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100,
}

2 在 middlewares.py 中定义 ProxyMiddleware 类，将代理IP添加到request.meta中，

# -*- coding: utf-8 -*-
"""
This a simple proxy for scrapy. 

The proxy host format like: `http://host:port` or `http://username:password@host:port`
"""

import random


class ProxyMiddleware(object):
    """Custom ProxyMiddleware."""
    def __init__(self, settings):
        self.proxy_list = settings.get('PROXY_LIST')
        with open(self.proxy_list) as f:
            self.proxies = [ip.strip() for ip in f]

    def parse_request(self, request, spider):
        request.meta['proxy'] = 'http://{}'.format(random.choice(self.proxies))

另外，还可以简单地通过环境变量来设置代理IP

1	export http_proxy = http://ip:port

Scraping Stackoverflow with Scrapy and MongoDB

发表于 2015-06-29 | 分类于 Python |

Scrapy 是基于 Python 的一个非常流行的网络爬虫框架。本文用 Scrapy 抓取 Stack Overflow 上的问题，
被问到得最频繁的排在前面，并将结果存储到 MongoDB 中。

环境配置

需要安装 Scrapy 和 PyMongo:

1 2	$ mkvirtualenv scrapy $ pip install scrapy pymongo

创建一个 project

1	$ scrapy startproject stack

一个 Scrapy project 的目录结构通常如下：

├── scrapy.cfg
└── stack
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

指定需要抓取的数据

items.py 文件就是用来定义将要抓取的数据的存储“容器”，类似 Python 中的 dict。本文抓取每个问题的
title、url、tags 和 status四个属性。因此，修改 items.py 文件：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field


class StackItem(Item):
    title = Field()
    url = Field()
    tags = Field()
    status = Field()

生成Spider

通过 Scrapy 内置的模板生成Spider

1	$ scrapy genspider stack_crawler stackoverflow.com -t crawl

此时，Scrapy project 的目录结构如下

├── scrapy.cfg
└── stack
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        ├── stack_crawler.py

修改 stack_crawler.py 文件，定义 parse——item 方法：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class StackCrawlerSpider(CrawlSpider):
    name = 'stack_crawler'
    allowed_domains = ['stackoverflow.com']
    start_urls = [
        'http://stackoverflow.com/questions?sort=frequent'
    ]

    rules = (
        Rule(LinkExtractor(allow=r'questions\?page=[0-9]&sort=frequent'),
             callback='parse_item', follow=True),
    )

    def parse(self, response):
        for href in response.xpath('//div[@class="question-summary"]'):
            url = response.urljoin(href.xpath('div/h3/a/@href').extract()[0])
            yield scrapy.Request(url, callback=self.parse_item)

    def parse_item(self, response):
        yield {
            'title': response.xpath('//h1/a/text()').extract()[0],
            'url': response.url,
            'tags': response.xpath('//a[@class="post-tag"]/text()').extract(),
            'status': {
                'votes': response.xpath(
                    '//div[@class="vote"]/span/text()').extract()[0],
                'favorite_count': response.xpath(
                    '//div[@class="favoritecount"]/b/text()').extract()[0],
                'answers': response.xpath(
                    '//span[@itemprop="answerCount"]/text()').extract()[0],
                'views': response.xpath(
                    '//td/p[@class="label-key"]/b/text()').extract()[1][:-6],
            },
        }

StackCrawlerSpider 类中各变量的意义，从变量名就很容易看出了：

name 定义 Spider 的名字
allowed_domains 是一个列表，列表的每一项是 Spider 允许访问的域名。
start_urls 是 Spider 开始抓取的起始 url。
rules 定义 Spider 进一步抓取的 url 规则，本例中用来分页抓取前10页的内容。

其中， parse_item 方法中的XPath语法，参见 Scrpay 的文档以及 XPath 文档。

存储设置，pipeline

Scrapy 采用 pipeline 机制来对抓取到的数据进行进一步的分析处理，比如持久化。
通过 settings.py 来定义 pipeline 和数据库配置选项：

ITEM_PIPELINES = ['stack.pipelines.MongoDBPipeline', ]

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "stackoverflow"
MONGODB_COLLECTION = "questions"

Pipeline设置

Scrapy 通过 pipeline 连接到数据库，在 pipeline.py 中定义：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
import logging


class MongoDBPipeline(object):
    def __init__(self):
        client = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = client.get_database(settings['MONGODB_DB'])
        self.collection = db.get_collection(settings['MONGODB_COLLECTION'])

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem('Missing {0}'.format(data))
        if valid:
            self.collection.update({'url': item['url']},
                                   dict(item), upsert=True)
            logging.log(logging.INFO, 'Question added to MongoDB database!')
        return item

抓取数据

启动 Spider 开始抓取数据。

1	$ scrapy crawl stack_crawler

完整代码见Github

配置Ubuntu Server 14.04 连接无线WiFi

发表于 2015-06-28 | 分类于 Linux |

服务器版的Ubuntu默认禁止连接WiFi。

用rfkill命令开启：

1	sudo rfkill unblock all

然后配置/etc/network/interfaces:

auto wlan0
    iface wlan0 inet dhcp
    wpa-driver wext
    wpa-ssid "TP-LINK_200F82"
    wpa-key-mgmt WPA-PSK
    wpa-ap-scan 2
    wpa-psk 9b231391b504add363e12b6d19a9c4eaf52d96eaa91266662c62bfa9aac1529

其中，wpa-ssid 指WiFi信号的SSID； wpa-key-mgmt 表示信号加密格式； wpa-psk 为加密过的WiFi 密码，通过如下命令产生：

1	wpa_passphrase SSID “password”

重启网络服务，即可连接WiFi 。

Leap Year

发表于 2015-06-27 | 分类于 Others |

直接明了的关于润年的算法：

1 2	def is_leap(year): return ((year % 4 == 0 && year % 100 != 0) \|\| year % 400 == 0)