网络爬虫开发(二)-爬虫基础——使用cheerio库解析html并提取img的src属性& 使用download库批量下载图片& encodeURI()函数-解决若有中文文件名，需使用base64编码

HTML与CSS基础来源：网络编辑：小编更新时间：2024-12-03 14:09:08 浏览量：58

网络爬虫开发(二)-爬虫基础——使用cheerio库解析html并提取img的src属性& 使用download库批量下载图片& encodeURI()函数-解决若有中文文件名，需使用base64编码

将获取的HTML字符串使用cheerio解析

学习目标：

使用cheerio加载HTML
回顾jQueryAPI
加载所有的img标签的src属性

cheerio库简介

npm地址——https://www.npmjs.com/package/cheerio

在这里插入图片描述

这是一个核心api按照jquery来设计，专门在服务器上使用，一个微小、快速和优雅的实现

简而言之，就是可以再服务器上用这个库来解析HTML代码，并且可以直接使用和jQuery一样的api

官方demo如下：

const cheerio = require('cheerio') const $ = cheerio.load('<h2 class="title">Hello world</h2>') $('h2.title').text('Hello there!') $('h2').addClass('welcome') $.html() //=> <html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>

同样也可以通过jQuery的api来获取DOM元素中的属性和内容

使用cheerio库解析HTML

分析网页中所有img标签所在结构

在这里插入图片描述

使用jQuery API获取所有img的src属性
在test文件夹下新建文件teacher_photos.js
test/teacher_photos.js

const http = require('http') const cheerio = require('cheerio') let req = http.request('http://web.itheima.com/teacher.html', res => { 
    let chunks = [] res.on('data', chunk => { 
    chunks.push(chunk) }) res.on('end', () => { 
    // console.log(Buffer.concat(chunks).toString('utf-8')) let html = Buffer.concat(chunks).toString('utf-8') let $ = cheerio.load(html) let imgArr = Array.prototype.map.call($('.tea_main .tea_con .li_img > img'), (item) => 'http://web.itheima.com/' + $(item).attr('src')) console.log(imgArr) // let imgArr = [] // $('.tea_main .tea_con .li_img > img').each((i, item) => { 
    // let imgPath = 'http://web.itheima.com/' + $(item).attr('src') // imgArr.push(imgPath) // }) // console.log(imgArr) }) }) req.end()

4.继续使用package.json文件依赖

test/package.json

{ 
    "name": "spider-demo", "version": "1.0.0", "description": "", "main": "index.js", "scripts": { 
    "test": "echo \"Error: no test specified\" && exit 1" }, "keywords": [], "author": "", "license": "ISC", "dependencies": { 
    "cheerio": "^1.0.0-rc.3", "download": "^7.1.0" } }

5.npm i安包后，在test文件夹打开终端，运行文件

node .\teacher_photos.js

使用download库批量下载图片

在test文件夹下替换文件teacher_photos.js

test/teacher_photos.js

const http = require('http') const cheerio = require('cheerio') const download = require('download') let req = http.request('http://web.itheima.com/teacher.html', res => { 
    let chunks = [] res.on('data', chunk => { 
    chunks.push(chunk) }) res.on('end', () => { 
    // console.log(Buffer.concat(chunks).toString('utf-8')) let html = Buffer.concat(chunks).toString('utf-8') let $ = cheerio.load(html) let imgArr = Array.prototype.map.call($('.tea_main .tea_con .li_img > img'), (item) => encodeURI('http://web.itheima.com/' + $(item).attr('src'))) // console.log(imgArr) Promise.all(imgArr.map(x => download(x, 'dist'))).then(() => { 
    console.log('files downloaded!'); }); }) }) req.end()

npm i安包后，在test文件夹打开终端，运行文件

node .\teacher_photos.js

到此这篇网络爬虫开发(二)-爬虫基础——使用cheerio库解析html并提取img的src属性& 使用download库批量下载图片& encodeURI()函数-解决若有中文文件名，需使用base64编码的文章就介绍到这了,更多相关内容请继续浏览下面的相关推荐文章，希望大家都能在编程的领域有一番成就！

上一篇：登录和注册（三）01登录页——原生loading、setItem设置session、store.commit使用vuex、原生http.post请求、removeAttribute()删除html属性

下一篇： css中三栏布局之两边宽度固定，中间宽度自适应-5种方法总结——flex布局、浮动布局、绝对定位布局、圣杯布局、双飞翼布局

版权声明：
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若内容造成侵权、违法违规、事实不符，请将相关资料发送至xkadmin@xkablog.com进行投诉反馈，一经查实，立即处理！

转载请注明出处，原文链接：https://www.xkablog.com/qdhtml/10725.html

网络爬虫开发(二)-爬虫基础——使用cheerio库解析html并提取img的src属性& 使用download库批量下载图片& encodeURI()函数-解决若有中文文件名，需使用base64编码

将获取的HTML字符串使用cheerio解析

cheerio库简介

使用cheerio库解析HTML

使用download库批量下载图片

相关文章：