Overriding start_requests is Scrapy not synchronous -
i'm trying override scrapy's start_requests
method, unsuccessful. i'm fine iterate through pages. problem have iterate firstly through cities , pages.
my code looks this:
url = "https://example.com/%s/?page=%d" starting_number = 1 number_of_pages = 3 cities = [] # there array of cities selected_city = "..." def start_requests(self): city in cities: selected_city = city print "####################" print "##### city: " + selected_city + " #####" in range(self.page_number, number_of_pages, +1): print "##### page: " + str(i) + " #####" yield scrapy.request(url=(url % (selected_city, i)), callback = self.parse) print "####################"
in console see when crawler starts working prints cities , pages, , start requests. therefore result crawler parses first city. work asynchronously, while need synchronous.
what right way iterate in case?
thanks help!
my problem used wrongly global variable selected_city
in remaining code.
i thought on every iteration stop parse
method, , continue next iteration. therefore set parameter item['city'] = selected_city
in parse
method.
now pass parameter city
through request
's meta parameter. sample code:
def start_requests(self): requests = [] city in cities: in range(self.page_number, number_of_pages, +1): requests.append(scrapy.request(url=(url % (city, i)), callback = self.parse, meta = {'city': city})) return requests
and in parse
method retrieving doing: item['city'] = response.request.meta['city']
Comments
Post a Comment