Overriding start_requests is Scrapy not synchronous -

- January 15, 2011

i'm trying override scrapy's start_requests method, unsuccessful. i'm fine iterate through pages. problem have iterate firstly through cities , pages.

my code looks this:

url = "https://example.com/%s/?page=%d" starting_number = 1 number_of_pages = 3 cities = [] # there array of cities selected_city = "..."  def start_requests(self):     city in cities:         selected_city = city          print "####################"         print "##### city: " + selected_city + " #####"          in range(self.page_number, number_of_pages, +1):             print "##### page: " + str(i) + " #####"             yield scrapy.request(url=(url % (selected_city, i)), callback = self.parse)          print "####################"

in console see when crawler starts working prints cities , pages, , start requests. therefore result crawler parses first city. work asynchronously, while need synchronous.

what right way iterate in case?

thanks help!

my problem used wrongly global variable selected_city in remaining code.

i thought on every iteration stop parse method, , continue next iteration. therefore set parameter item['city'] = selected_city in parse method.

now pass parameter city through request's meta parameter. sample code:

def start_requests(self):     requests = []      city in cities:         in range(self.page_number, number_of_pages, +1):             requests.append(scrapy.request(url=(url % (city, i)), callback = self.parse, meta = {'city': city}))      return requests

and in parse method retrieving doing: item['city'] = response.request.meta['city']

Search This Blog

Erty

Overriding start_requests is Scrapy not synchronous -

Comments

Post a Comment

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

python - malformed header from script index.py Bad header -