Is Scrapy overkill for this kind of task?

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

jacques01
Posts: 42
Joined: Thu Oct 08, 2015 4:56 am UTC

Is Scrapy overkill for this kind of task?

Postby jacques01 » Sun Aug 21, 2016 5:46 pm UTC

Suppose there is a stream of incoming URLs. The task is to get the HTML of each of these URLs and store it in a database. There is no need for traditional crawling, as each URL is the end of the line.

What I'm thinking for my technology stack:

1) Flask for web container
2) Celery to manage tasks / queue of URLs
3) Requests library to get the HTML of each URL
4) Save HTML to a MongoDB (key is URL since it's necessarily unique)
5) Pool of proxies to avoid blacklisting.

Things I'd need to implement:

1. Controlling concurrency
2. Adding in delays / auto throttling.
3. Detecting when a proxy / IP address is no longer productive
4. Controlling speed of scraping

My understand is that Scrapy could do those all things easily except Proxy management. But because I'm actually just interested in the HTML and not the crawling, I'm unsure if Scrapy is necessary for this task.

User avatar
Flumble
Yes Man
Posts: 1939
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Is Scrapy overkill for this kind of task?

Postby Flumble » Sun Aug 21, 2016 6:59 pm UTC

We may be able to help if you can provide us reasonable evidence that you're not breaching your country's law and the target's terms of service. :wink:

lorb
Posts: 404
Joined: Wed Nov 10, 2010 10:34 am UTC
Location: Austria

Re: Is Scrapy overkill for this kind of task?

Postby lorb » Tue Aug 23, 2016 10:00 am UTC

Scrapy is pretty much exactly the right tool for this. Use scrapy-cloud and they also deal with avoiding the blacklisting for you, if what you do is legal/sane. If you really need to scrape so many URLs that concurrency is unavoidable and a concern they deal with that too, but you will have to get one of the paid-for plans.
Please be gracious in judging my english. (I am not a native speaker/writer.)
http://decodedarfur.org/

jacques01
Posts: 42
Joined: Thu Oct 08, 2015 4:56 am UTC

Re: Is Scrapy overkill for this kind of task?

Postby jacques01 » Wed Aug 24, 2016 12:43 am UTC

Could you explain why Scrapy is the right tool for this task?

My understanding is that Scrapy is good if you're doing actual crawling, i.e.:

1. Start from a predetermined group of seed urls
2. Put these into your URL pool.
3. For each url in your URL pool:
a. Get the page HTML
b. Do something with the HTML
c. Find other URLs to add to the pool

I'm only doing step a.). That is, my list of URLs will never change based on what the spider discovers, because it's not discovery. It's just scraping.

What advantages does Scrapy offer for just getting HTML versus the approach I outlined above?

lorb
Posts: 404
Joined: Wed Nov 10, 2010 10:34 am UTC
Location: Austria

Re: Is Scrapy overkill for this kind of task?

Postby lorb » Wed Feb 08, 2017 2:03 pm UTC

It offers the advantage of not having to build the whole tech stack that you outlined. You just run scrapy and don't do b) and c). All you need is to install scrapy and write a python snippet of code that is about 10 lines, or run it from scrapy cloud and you don't even have to install anything. I can't imagine anything that you built yourself being easier/better.
Please be gracious in judging my english. (I am not a native speaker/writer.)
http://decodedarfur.org/


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 13 guests