Skip to content

Insutanto/scrapy-distributed

Repository files navigation

Scrapy-Distributed

Scrapy-Distributed is a series of components for you to develop a distributed crawler base on Scrapy in an easy way.

Now! Scrapy-Distributed has supported RabbitMQ Scheduler, Kafka Scheduler and RedisBloom DupeFilter. You can use either of those in your Scrapy's project very easily.

Features

  • RabbitMQ Scheduler
    • Support custom declare a RabbitMQ's Queue. Such as passivedurableexclusiveauto_delete, and all other options.
  • RabbitMQ Pipeline
    • Support custom declare a RabbitMQ's Queue for the items of spider. Such as passivedurableexclusiveauto_delete, and all other options.
  • Kafka Scheduler
    • Support custom declare a Kafka's Topic. Such as num_partitionsreplication_factor and will support other options.
  • RedisBloom DupeFilter
    • Support custom the keyerrorRatecapacityexpansion and auto-scaling(noScale) of a bloom filter.
  • Custom DupeFilter Interface
    • Implement your own deduplication logic by extending BaseDupeFilter.

Requirements

  • Python >= 3.6
  • Scrapy >= 1.8.0
  • Pika >= 1.0.0
  • RedisBloom >= 0.2.0
  • Redis >= 3.0.1
  • kafka-python >= 1.4.7

TODO

  • RabbitMQ Item Pipeline
  • Support Delayed Message in RabbitMQ Scheduler
  • Support Scheduler Serializer
  • Custom Interface for DupeFilter
  • RocketMQ Scheduler
  • RocketMQ Item Pipeline
  • SQLAlchemy Item Pipeline
  • Mongodb Item Pipeline
  • Kafka Scheduler
  • Kafka Item Pipeline

Usage

Step 0:

pip install scrapy-distributed

OR

git clone https://github.com/Insutanto/scrapy-distributed.git && cd scrapy-distributed
&& python setup.py install

There are example projects in examples/rabbitmq_example and examples/kafka_example. Here is the fast way to use Scrapy-Distributed.

If you don't have the required environment for tests:

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management

# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

cd examples/rabbitmq_example
python run_simple_example.py

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d
cd examples/rabbitmq_example
python run_simple_example.py

If you don't have the required environment for tests:

# make sure you have a Kafka running on localhost:9092
# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

cd examples/kafka_example
python run_simple_example.py

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d
cd examples/kafka_example
python run_simple_example.py

RabbitMQ Support

If you don't have the required environment for tests:

# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0.0.0:15672:15672 -p 0.0.0.0:5672:5672 rabbitmq:3-management

# pull and run a RedisBloom container.
docker run -d --name redisbloom -p 6379:6379 redis/redis-stack

Or you can use docker compose:

docker compose -f ./docker-compose.dev.yaml up -d

Step 1:

Only by change SCHEDULERDUPEFILTER_CLASS and add some configs, you can get a distributed crawler in a moment.

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/?heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

# disable the RedirectMiddleware, because the RabbitMiddleware can handle those redirect request.
DOWNLOADER_MIDDLEWARES = {
    ...
    "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": None,
    "scrapy_distributed.middlewares.amqp.RabbitMiddleware": 542
}

# add RabbitPipeline, it will push your items to rabbitmq's queue. 
ITEM_PIPELINES = {
    ...
   'scrapy_distributed.pipelines.amqp.RabbitPipeline': 301,
}


Step 2:

scrapy crawl <your_spider>

Kafka Support

Step 1:

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.kafka.KafkaQueue"
KAFKA_CONNECTION_PARAMETERS = "localhost:9092"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
BLOOM_DUPEFILTER_CAPACITY = 100_0000

DOWNLOADER_MIDDLEWARES = {
    ...
   "scrapy_distributed.middlewares.kafka.KafkaMiddleware": 542
}

Step 2:

scrapy crawl <your_spider>

Custom DupeFilter

You can implement your own deduplication logic by extending scrapy_distributed.dupefilters.BaseDupeFilter.

Step 1: Implement BaseDupeFilter

from scrapy_distributed.dupefilters import BaseDupeFilter


class MyDupeFilter(BaseDupeFilter):

    @classmethod
    def from_settings(cls, settings):
        return cls()

    @classmethod
    def from_spider(cls, spider):
        instance = cls.from_settings(spider.settings)
        # read spider-specific config, e.g. spider.dupefilter_conf
        return instance

    def request_seen(self, request):
        # return True if the request was already seen
        ...

    def open(self, spider=None):
        pass

    def close(self, reason=""):
        pass

Step 2: Register the filter in settings

SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
DUPEFILTER_CLASS = "myproject.dupefilters.MyDupeFilter"

The DistributedScheduler will call MyDupeFilter.from_spider(spider) during startup so that the filter can read any spider-level configuration it needs.

Reference Project

scrapy-rabbitmq-link(scrapy-rabbitmq-link)

scrapy-redis(scrapy-redis)

Codebase Overview

The Scrapy-Distributed project enables building distributed crawlers on top of Scrapy. It supplies scheduler, queue, middleware, pipelines, and duplicate filtering components that coordinate work across RabbitMQ, Kafka and RedisBloom.

Repository Layout

  • scrapy_distributed/ – library modules
    • amqp_utils/ – RabbitMQ helpers
    • common/ – queue configuration objects
    • dupefilters/ – Redis Bloom-based duplicate filter
    • middlewares/ – downloader middlewares for ACK/requeue
    • pipelines/ – item pipelines to publish items to queues
    • queues/ – RabbitMQ and Kafka queue implementations
    • redis_utils/ – Redis connection helpers
    • schedulers/ – distributed scheduler combining queue and dupe filter
    • spiders/ – mixins and example spiders
  • examples/ – small Scrapy projects showing how to use RabbitMQ and Kafka
  • tests/ – unit tests for key components

Key Components

  • DistributedScheduler combines queue-based scheduling with a Redis Bloom duplicate filter.
  • RabbitQueue and KafkaQueue serialize Scrapy requests to publish/consume through RabbitMQ or Kafka.
  • RedisBloomDupeFilter tracks seen URLs using Redis Bloom filters.
  • RabbitMiddleware and RabbitPipeline handle acknowledgement and item publishing.

Example Usage

Example projects under examples/ demonstrate how to configure the scheduler, queue, middleware and pipeline. Supporting services can be launched with the provided docker-compose.dev.yaml.

Learning Path

  1. Run the examples to see the distributed scheduler in action.
  2. Review schedulers and queues modules to understand request flow.
  3. Customize queue and Bloom filter settings via objects in common and redis_utils.
  4. Extend middlewares or pipelines to integrate with additional services.

About

A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages