Code cloud link: Experiment 3
1. Operation ①
1.1 operation content
- Content: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 4 digits of the student number)
- Output information: output the downloaded Url information on the console, store the downloaded image in the images sub file, and give a screenshot.
1.2 problem solving ideas
1.2.1 observe and analyze the main page and find that the picture is under the img tag and the sub website is in the href of the a tag
1.2.2 while traversing the website, compile spider function to crawl some pictures from each website
soup = BeautifulSoup(data, "lxml") images = soup.select("img") links = soup.select("a") if len(images) >= 5: for i in range(5): if images[i]["src"].startswith("http") and images[i]["src"] not in imgs: imgs.append(images[i]["src"]) if len(imgs) >= 121 or deep == 0: return for link in links: if link["href"].startswith("http"): spider(link["href"],deep-1)
1.2.3 single thread downloading pictures
def download(): global imgs for i in range(121): request.urlretrieve(imgs[i],r"D:\data acquisition\demo\practice\3\3.1\images\The first"+str(i+1)+"Zhang Tu.jpg") print("downloaded ", imgs[i])
1.2.4 multi thread downloading pictures
def download_threads(): global imgs for i in range(121): T = threading.Thread(target=download, args=(imgs[i], i+1)) T.setDaemon(False) T.start() threads.append(T)
1.2.5 results
1.3 experience
- Single thread downloads pictures in order, which is very simple and intuitive, and the speed is slow. Multi thread does not necessarily execute in order, but the download speed is fast
- Familiar with the traversal of the website and consolidated the understanding of multithreaded crawler
2. Operation ②
2.1 operation contents
Request: use the sketch framework to reproduce the job ①. Output information: the same as operation ①
2.2 problem solving ideas
2.2.1 set the unique parameter Src in item_ URL to store the picture address and write mySpider
class mySpider(scrapy.Spider): name = "mySpider" count = 0 imgs = [] def start_requests(self): url = "http://www.weather.com.cn/" yield scrapy.Request(url=url, callback=self.parse, meta={"deep":3}) def parse(self, response): try: deep = response.meta["deep"] if deep <= 0 or self.count >= 121: return data = response.body.decode() selector = scrapy.Selector(text=data) images = selector.xpath("//img/@src").extract() n = 0 for i in images: if i.startswith("http") and i not in self.imgs: item = DemoItem() item["src_url"] = i self.imgs.append(i) self.count += 1 n += 1 if n > 5: break yield item links = selector.xpath("//a/@href").extract() for link in links: if link.startswith("http"): url = response.urljoin(link) yield scrapy.Request(url=url, callback=self.parse, meta={"deep":deep-1}) except Exception as err: print(err)
2.2.2 download pictures in pipelines
class DemoPipeline: count = 1 def process_item(self, item, spider): try: if self.count <= 121: with open("./images/The first" + str(self.count) + "Zhang Tu.jpg", "wb") as f: print(str(self.count)+":"+item["src_url"]) img = requests.get(item["src_url"]).content f.write(img) self.count += 1 else: return except Exception as err: print(err) return item
2.2.3 modifying setting
BOT_NAME = 'demo' SPIDER_MODULES = ['demo.spiders'] NEWSPIDER_MODULE = 'demo.spiders' ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'demo.pipelines.DemoPipeline': 300, }
2.2.4 output results
2.3 experience
- After reviewing the scratch crawler framework and learning how to use xpath, I feel that skilled xpath will be more convenient than CSS
- Understand the parameter transmission mechanism of parse in scratch
Operation ③
3.1 operation contents
-
Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.
All college information is stored in the database, and the process recording Gif of browser F12 debugging and analysis is added to the blog. -
Output information:
Serial number Movie title director performer brief introduction Film rating Film cover 1 The Shawshank Redemption Frank delabond Tim Robbins Want to set people free 9.7 ./imgs/xsk.jpg 2....
3.2 problem solving ideas
3.2.1 analyze the web page content and find the location of the required content
3.2.2 write movieItem
class MovieItem(scrapy.Item): rank = scrapy.Field() name = scrapy.Field() director = scrapy.Field() actor = scrapy.Field() score = scrapy.Field() img = scrapy.Field() desp = scrapy.Field()
3.2.3 compiling spider
Crawl to page 10
def start_requests(self): for i in range(10): url = "https://movie.douban.com/top250?start=" + str(i * 25) yield scrapy.Request(url=url,callback=self.parse)
parse function
def parse(self, response): try: data = response.body.decode() selector = scrapy.Selector(text=data) # Get each movie information movies = selector.xpath("//li/div[@class='item']") for m in movies: rank = m.xpath("./div[@class='pic']/em/text()").extract_first() image = m.xpath("./div[@class='pic']/a/img/@src").extract_first() name = m.xpath("./div[@class='info']//span[@class='title']/text()").extract_first() members = m.xpath("./div[@class='info']//p[@class='']/text()").extract_first() desp = m.xpath("./div[@class='info']//p[@class='quote']/span/text()").extract_first() score = m.xpath("./div[@class='info']//span[@class='rating_num']/text()").extract_first() item = MovieItem() item['rank'] = rank item['name'] = name director = re.search(r'director:(.*?)\s main', members).group(1) actor = re.search(r'to star:(.*)', members) item['director'] = director #actor may be empty because there are cartoons without actors if actor == None: item['actor'] = "null" else: item["actor"] = actor.group(1) item['desp'] = desp item['score'] = score item['img'] = image yield item except Exception as err: print(err)
3.2.4 compiling database classes
class MovieDB: def openDB(self): self.con = sqlite3.connect("movies.db") self.cursor = self.con.cursor() try: self.cursor.execute("create table movies (Rank int,Name varchar(32),Director varchar(32)," "Actors varchar(64),Description varchar(64),Score varchar(8),ImgPath varchar(64))") except: self.cursor.execute("delete from movies") def closeDB(self): self.con.commit() self.con.close() def insert(self, Rank,Name,Director,Actors,Description,Score,ImgPath): try: self.cursor.execute("insert into movies (Rank,Name,Director,Actors,Description,Score,ImgPath) " "values (?,?,?,?,?,?,?)", (Rank,Name,Director,Actors,Description,Score,ImgPath)) except Exception as err: print(err)
3.2.5 compiling pipelines
class MoviespiderPipeline: def open_spider(self, spider): self.db = MovieDB() self.db.openDB() def process_item(self, item, spider): path = r"./images/"+item['name']+".jpg" url = item['img'] img = requests.get(url).content with open(path,"wb") as f: f.write(img) print("The first"+item['rank']+"Page cover downloaded successfully") self.db.insert(int(item['rank']),item['name'],item['director'],item['actor'],item['desp'],item["score"],path) print("The first" + item['rank'] +'Movie data inserted successfully') return item def close_spider(self, spider): self.db.closeDB() print("End crawling")
3.2.6 modifying setting
3.2.7 output results
3.3 experience
- More proficient in the use of scratch and xpath