网络爬虫开发(四)-爬虫基础——环境准备、定义options接口、抽取公共部分代码、定义抽象方法、实现TeacherPhotos类、实现NewsList类及总结
封装爬虫基础库
以上代码重复的地方非常多,可以考虑以面向对象的思想进行封装,进一步的提高代码复用率,为了方便开发,保证代码规范,建议使用TypeScript进行封装
以下知识点为扩展内容,需要对面向对象和TypeScript有一定了解!
执行tsc --init
初始化项目,生成ts配置文件
TS配置:
{
"compilerOptions": {
/* Basic Options */ "target": "es2015", "module": "commonjs", "outDir": "./bin", "rootDir": "./src", "strict": true, "esModuleInterop": true }, "include": [ "src//*" ], "exclude": [ "node_modules", "/*.spec.ts" ] }
Spider抽象类:定义options接口、抽取公共部分代码
// 引入http模块 const http = require('http') import SpiderOptions from './interfaces/SpiderOptions' export default abstract class Spider {
options: SpiderOptions; constructor(options: SpiderOptions = {
url: '', method: 'get' }) {
this.options = options this.start() } start(): void {
// 创建请求对象 (此时未发送http请求) let req = http.request(this.options.url, {
headers: this.options.headers, method: this.options.method }, (res: any) => {
// 异步的响应 // console.log(res) let chunks: any[] = [] // 监听data事件,获取传递过来的数据片段 // 拼接数据片段 res.on('data', (c: any) => chunks.push(c)) // 监听end事件,获取数据完毕时触发 res.on('end', () => {
// 拼接所有的chunk,并转换成字符串 ==> html字符串 let htmlStr = Buffer.concat(chunks).toString('utf-8') this.onCatchHTML(htmlStr) }) }) // 将请求发出去 req.end() } abstract onCatchHTML(result: string): any } export default Spider
SpiderOptions接口:
export default interface SpiderOptions {
url: string, method?: string, headers?: object }
PhotoListSpider类:
import Spider from './Spider' const cheerio = require('cheerio') const download = require('download') export default class PhotoListSpider extends Spider {
onCatchHTML(result: string) {
// console.log(result) let $ = cheerio.load(result) let imgs = Array.prototype.map.call($('.tea_main .tea_con .li_img > img'), item => 'http://web.itheima.com/' + encodeURI($(item).attr('src'))) Promise.all(imgs.map(x => download(x, 'dist'))).then(() => {
console.log('files downloaded!'); }); } }
NewsListSpider类:
import Spider from "./Spider"; export default class NewsListSpider extends Spider {
onCatchHTML(result: string) {
console.log(JSON.parse(result)) } }
测试类:
import Spider from './Spider' import PhotoListSpider from './PhotoListSpider' import NewsListSpider from './NewsListSpider' let spider1: Spider = new PhotoListSpider({
url: 'http://web.itheima.com/teacher.html' }) let spider2: Spider = new NewsListSpider({
url: 'http://www.itcast.cn/news/json/f1f5ccee-1158-49a6-b7c4-f0bf40d5161a.json', method: 'post', headers: {
"Host": "www.itcast.cn", "Connection": "keep-alive", "Content-Length": "0", "Accept": "*/*", "Origin": "http://www.itcast.cn", "X-Requested-With": "XMLHttpRequest", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36", "DNT": "1", "Referer": "http://www.itcast.cn/newsvideo/newslist.html", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Cookie": "UM_distinctid=16b8a0c1ea534c-0c311b256ffee7-e--16b8a0c1ea689c; bad_idb2f10070-624e-11e8-917f-9fb8db4dc43c=8e1dcca1-9692-11e9-97fb-e5908bcaecf8; parent_qimo_sid_b2f10070-624e-11e8-917f-9fb8db4dc43c=921b3900-9692-11e9-9a47-855e632e21e7; CNZZDATA=--null%7C; cid_litiancheng_itcast.cn=TUd3emFUWjBNV2syWVRCdU5XTTRhREZs; PHPSESSID=j3ppafq1dgh2jfg6roc8eeljg2; CNZZDATA=cnzz_eid%3D--http%253A%252F%252Fmail.itcast.cn%252F%26ntime%3D; Hm_lvt_0cb375a2eb74efffa6c71ee607=,; qimo_seosource_22bdcd10-6250-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_22bdcd10-6250-11e8-917f-9fb8db4dc43c=; href=http%3A%2F%2Fwww.itcast.cn%2F; bad_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b71-a7a4-11e9-93cc-9ba8cb; nice_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b72-a7a4-11e9-93cc-9ba8cb; openChat22bdcd10-6250-11e8-917f-9fb8db4dc43c=true; parent_qimo_sid_22bdcd10-6250-11e8-917f-9fb8db4dc43c=fc61e520-a7a4-11e9-94a8-01dabdc2ed41; qimo_seosource_b2f10070-624e-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_b2f10070-624e-11e8-917f-9fb8db4dc43c=; accessId=b2f10070-624e-11e8-917f-9fb8db4dc43c; pageViewNum=2; nice_idb2f10070-624e-11e8-917f-9fb8db4dc43c=20d2a1d1-a7a8-11e9-bc20-e71d1b8e4bb6; openChatb2f10070-624e-11e8-917f-9fb8db4dc43c=true; Hm_lpvt_0cb375a2eb74efffa6c71ee607=" } })
封装后,如果需要写新的爬虫,则可以直接继承Spider类后,在测试类中进行测试即可,仅需实现具体的爬虫类onCatchHTML方法,测试时传入url和headers即可。
而且全部爬虫的父类均为Spider,后期管理起来也非常方便!
实例1
目录
第一步:执行tsc --init
初始化项目,生成ts配置文件
tsconfig.json
{
"compilerOptions": {
/* Basic Options */ // "incremental": true, /* Enable incremental compilation */ "target": "ES2015", /* Specify ECMAScript target version: 'ES3' (default), 'ES5', 'ES2015', 'ES2016', 'ES2017', 'ES2018', 'ES2019' or 'ESNEXT'. */ "module": "commonjs", /* Specify module code generation: 'none', 'commonjs', 'amd', 'system', 'umd', 'es2015', or 'ESNext'. */ // "lib": [], /* Specify library files to be included in the compilation. */ // "allowJs": true, /* Allow javascript files to be compiled. */ // "checkJs": true, /* Report errors in .js files. */ // "jsx": "preserve", /* Specify JSX code generation: 'preserve', 'react-native', or 'react'. */ // "declaration": true, /* Generates corresponding '.d.ts' file. */ // "declarationMap": true, /* Generates a sourcemap for each corresponding '.d.ts' file. */ // "sourceMap": true, /* Generates corresponding '.map' file. */ // "outFile": "./", /* Concatenate and emit output to single file. */ "outDir": "./bin", /* Redirect output structure to the directory. */ "rootDir": "./src", /* Specify the root directory of input files. Use to control the output directory structure with --outDir. */ // "composite": true, /* Enable project compilation */ // "tsBuildInfoFile": "./", /* Specify file to store incremental compilation information */ // "removeComments": true, /* Do not emit comments to output. */ // "noEmit": true, /* Do not emit outputs. */ // "importHelpers": true, /* Import emit helpers from 'tslib'. */ // "downlevelIteration": true, /* Provide full support for iterables in 'for-of', spread, and destructuring when targeting 'ES5' or 'ES3'. */ // "isolatedModules": true, /* Transpile each file as a separate module (similar to 'ts.transpileModule'). */ /* Strict Type-Checking Options */ "strict": true, /* Enable all strict type-checking options. */ // "noImplicitAny": true, /* Raise error on expressions and declarations with an implied 'any' type. */ // "strictNullChecks": true, /* Enable strict null checks. */ // "strictFunctionTypes": true, /* Enable strict checking of function types. */ // "strictBindCallApply": true, /* Enable strict 'bind', 'call', and 'apply' methods on functions. */ // "strictPropertyInitialization": true, /* Enable strict checking of property initialization in classes. */ // "noImplicitThis": true, /* Raise error on 'this' expressions with an implied 'any' type. */ // "alwaysStrict": true, /* Parse in strict mode and emit "use strict" for each source file. */ /* Additional Checks */ // "noUnusedLocals": true, /* Report errors on unused locals. */ // "noUnusedParameters": true, /* Report errors on unused parameters. */ // "noImplicitReturns": true, /* Report error when not all code paths in function return a value. */ // "noFallthroughCasesInSwitch": true, /* Report errors for fallthrough cases in switch statement. */ /* Module Resolution Options */ // "moduleResolution": "node", /* Specify module resolution strategy: 'node' (Node.js) or 'classic' (TypeScript pre-1.6). */ // "baseUrl": "./", /* Base directory to resolve non-absolute module names. */ // "paths": {}, /* A series of entries which re-map imports to lookup locations relative to the 'baseUrl'. */ // "rootDirs": [], /* List of root folders whose combined content represents the structure of the project at runtime. */ // "typeRoots": [], /* List of folders to include type definitions from. */ // "types": [], /* Type declaration files to be included in compilation. */ // "allowSyntheticDefaultImports": true, /* Allow default imports from modules with no default export. This does not affect code emit, just typechecking. */ "esModuleInterop": true /* Enables emit interoperability between CommonJS and ES Modules via creation of namespace objects for all imports. Implies 'allowSyntheticDefaultImports'. */ // "preserveSymlinks": true, /* Do not resolve the real path of symlinks. */ // "allowUmdGlobalAccess": true, /* Allow accessing UMD globals from modules. */ /* Source Map Options */ // "sourceRoot": "", /* Specify the location where debugger should locate TypeScript files instead of source locations. */ // "mapRoot": "", /* Specify the location where debugger should locate map files instead of generated locations. */ // "inlineSourceMap": true, /* Emit a single file with source maps instead of having a separate file. */ // "inlineSources": true, /* Emit the source alongside the sourcemaps within a single file; requires '--inlineSourceMap' or '--sourceMap' to be set. */ /* Experimental Options */ // "experimentalDecorators": true, /* Enables experimental support for ES7 decorators. */ // "emitDecoratorMetadata": true, /* Enables experimental support for emitting type metadata for decorators. */ }, "include": [ "src//*" ], "exclude": [ "node_modules", "/*.spec.ts" ] }
第二步:在src文件夹下新建一个Spider类,定义抽象方法
src/Spider.ts
// 目标: 希望将来写爬虫的时候, 来一个类继承祖宗类 // 然后, 在子类中处理得到的结果即可 // 爬虫用法: 创建爬虫对象, 传入URL自动开爬 const http = require('http') import SpiderOptions from './interfaces/SpiderOptions' export default abstract class Spider {
// 定义成员 options: SpiderOptions // 使用接口定义options的成员 constructor(options: SpiderOptions = {
url: '', method: 'get' }) {
// 初始化 this.options = options this.start() } start() {
// 创建请求对象 let req = http.request(this.options.url, {
headers: this.options.headers, method: this.options.method }, (res: any) => {
let chunks: any[] = [] res.on('data', (c: any) => chunks.push(c)) res.on('end', () => {
let result = Buffer.concat(chunks).toString('utf-8') // console.log(result) // 抽象方法调用 子子孙孙干的事儿 他都不知道 他只管调用一下抽象方法 // 具体的实现由子子孙孙继承时实现即可 this.onCatchHTML(result) }) }) // 发送请求 req.end() } abstract onCatchHTML(result: string): any }
第三步:新建文件夹interfaces,并在其中新建接口文件SpiderOptions.ts,对外定义Options接口
src/interfaces/SpiderOptions.ts
export default interface SpiderOptions {
url: string, method?: string, headers?: object }
第四步:在src文件夹下新建爬虫文件TeacherPhotos.ts
src/TeacherPhotos.ts
// 封装完毕后,如果需要做爬虫,只需要以下几步: // 1. 写一个爬虫类, 继承Spider // 2. 实现onCatchHTML方法(爬虫获取资源后需要做的事情) // 3. 使用: 创建该爬虫对象,传入URL即可 const cheerio = require('cheerio') const download = require('download') import Spider from './Spider' export default class TeacherPhotos extends Spider {
onCatchHTML(result: string) {
// 获取到html之后的操作 由子类具体实现 // console.log(result) // 根据html的img标签src属性来下载图片 let $ = cheerio.load(result) let imgs = Array.prototype.map.call($('.tea_main .tea_con .li_img > img'), (item: any) => 'http://web.itheima.com/' + encodeURI($(item).attr('src'))) Promise.all(imgs.map(x => download(x, 'dist'))).then(() => {
console.log('files downloaded!'); }); } }
第五步:在src文件夹下新建测试接口文件test.ts
src/test.ts
import TeacherPhotos from './TeacherPhotos' new TeacherPhotos({
url: 'http://web.itheima.com/teacher.html' })
编译成tes.js文件后,运行文件
node .\bin\test.js
此时,就爬取了http://web.itheima.com/teacher.html
的html文件信息
实例2
前四步同上,
第五步:在src文件夹下新建测试接口文件test.ts
src/test.ts
import NewsList from './NewsList' new NewsList({
url: 'http://www.itcast.cn/news/json/f1f5ccee-1158-49a6-b7c4-f0bf40d5161a.json' })
也可以配置请求头
src/test.ts
// import Spider from './Spider' // new Spider({
// url: 'http://www.itcast.cn/newsvideo/newslist.html' // }) // new Spider({
// url: 'http://web.itheima.com/teacher.html' // }) // import TeacherPhotos from './TeacherPhotos' // new TeacherPhotos({
// url: 'http://web.itheima.com/teacher.html' // }) import NewsList from './NewsList' new NewsList({
url: 'http://www.itcast.cn/news/json/f1f5ccee-1158-49a6-b7c4-f0bf40d5161a.json', method: 'post', headers: {
"Host": "www.itcast.cn", "Connection": "keep-alive", "Content-Length": "0", "Accept": "*/*", "Origin": "http://www.itcast.cn", "X-Requested-With": "XMLHttpRequest", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36", "DNT": "1", "Referer": "http://www.itcast.cn/newsvideo/newslist.html", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", "Cookie": "UM_distinctid=16b8a0c1ea534c-0c311b256ffee7-e--16b8a0c1ea689c; bad_idb2f10070-624e-11e8-917f-9fb8db4dc43c=8e1dcca1-9692-11e9-97fb-e5908bcaecf8; parent_qimo_sid_b2f10070-624e-11e8-917f-9fb8db4dc43c=921b3900-9692-11e9-9a47-855e632e21e7; CNZZDATA=--null%7C; cid_litiancheng_itcast.cn=TUd3emFUWjBNV2syWVRCdU5XTTRhREZs; PHPSESSID=j3ppafq1dgh2jfg6roc8eeljg2; CNZZDATA=cnzz_eid%3D--http%253A%252F%252Fmail.itcast.cn%252F%26ntime%3D; Hm_lvt_0cb375a2eb74efffa6c71ee607=,; qimo_seosource_22bdcd10-6250-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_22bdcd10-6250-11e8-917f-9fb8db4dc43c=; href=http%3A%2F%2Fwww.itcast.cn%2F; bad_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b71-a7a4-11e9-93cc-9ba8cb; nice_id22bdcd10-6250-11e8-917f-9fb8db4dc43c=f2f41b72-a7a4-11e9-93cc-9ba8cb; openChat22bdcd10-6250-11e8-917f-9fb8db4dc43c=true; parent_qimo_sid_22bdcd10-6250-11e8-917f-9fb8db4dc43c=fc61e520-a7a4-11e9-94a8-01dabdc2ed41; qimo_seosource_b2f10070-624e-11e8-917f-9fb8db4dc43c=%E7%AB%99%E5%86%85; qimo_seokeywords_b2f10070-624e-11e8-917f-9fb8db4dc43c=; accessId=b2f10070-624e-11e8-917f-9fb8db4dc43c; pageViewNum=2; nice_idb2f10070-624e-11e8-917f-9fb8db4dc43c=20d2a1d1-a7a8-11e9-bc20-e71d1b8e4bb6; openChatb2f10070-624e-11e8-917f-9fb8db4dc43c=true; Hm_lpvt_0cb375a2eb74efffa6c71ee607=" } })
编译成tes.js文件后,运行文件
node .\bin\test.js
此时,就爬取了接口的文件信息
到此这篇网络爬虫开发(四)-爬虫基础——环境准备、定义options接口、抽取公共部分代码、定义抽象方法、实现TeacherPhotos类、实现NewsList类及总结的文章就介绍到这了,更多相关内容请继续浏览下面的相关推荐文章,希望大家都能在编程的领域有一番成就!版权声明:
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若内容造成侵权、违法违规、事实不符,请将相关资料发送至xkadmin@xkablog.com进行投诉反馈,一经查实,立即处理!
转载请注明出处,原文链接:https://www.xkablog.com/kotlinkf/10723.html