当前位置: 首页 > news >正文

102302145 黄加鸿 数据采集与融合技术作业2

作业2


目录
  • 作业2
    • 作业①
      • 1)代码与结果
      • 2)心得体会
      • 3)Gitee链接
    • 作业②
      • 1)代码与结果
      • 2)心得体会
      • 3)Gitee链接
    • 作业③
      • 1)代码与结果
        • F12调试分析Gif
      • 2)心得体会
      • 3)Gitee链接


作业①

1)代码与结果

中国气象网在之前任务中已经进行了网页分析,这里不再展示分析过程,最后是选用BeautifulSoup来解析网页。

核心代码

class WeatherForecast:def __init__(self):self.headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}self.cityCode = {"福州": "101230101", "厦门": "101230201", "泉州": "101230501"}  # 城市集def forecastCity(self, city):if city not in self.cityCode.keys():print(city + " code cannot be found")return# 根据城市代码不同设计url进行多网页爬取url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"try:req = urllib.request.Request(url, headers=self.headers)data = urllib.request.urlopen(req)data = data.read()dammit = UnicodeDammit(data, ["utf-8", "gbk"])data = dammit.unicode_markupsoup = BeautifulSoup(data, "lxml")#定位到表格信息lis = soup.select("ul[class='t clearfix'] li")for li in lis:try:date = li.select('h1')[0].textweather = li.select('p[class="wea"]')[0].textif li.select('p[class="tem"] span'):temp = li.select('p[class="tem"] span')[0].text + "/" + li.select('p[class="tem"] i')[0].textelse:temp = li.select('p[class="tem"] i')[0].textprint(city, date, weather, temp)self.db.insert(city, date, weather, temp)except Exception as err:print(err)except Exception as err:print(err)

运行结果

天气控制台

查看数据库保存情况:

数据库查看1

2)心得体会

解决这类任务基本上都需要定义一个爬虫类和一个数据库类,思路清晰。学习了数据库类的定义和编写方法,实现打开、关闭、插入、查询等数据库操作。

3)Gitee链接

· https://gitee.com/jh2680513769/2025_crawler_project/blob/master/%E4%BD%9C%E4%B8%9A2/1.py

作业②

1)代码与结果

先上网站进行F12分析调试,刷新页面后在“网络日志”中找到存放网站股票信息的JSON文件(以“get?”为开头),如图:

网页分析

股票网页

通过分析加载不同股票页面的url之间的参数差异,比如参数‘pn’表示页面数,可以设计分页或多页爬取。

另外,json文件的内容类似‘{Key}:{Value}’字典类型要先引入json库进行结构化解析,然后再根据‘f2’‘f12’‘f14’等参数提取出股票代码、名称等信息。

基于如上分析,设计一个多页爬取并保存数据,代码核心部分如下:

核心代码

import requests
import json
import sqlite3class StockDB:def openDB(self):self.con = sqlite3.connect("./stocks.db")self.cursor = self.con.cursor()try:self.cursor.execute("create table stocks(sNum varchar(16), sCode varchar(16), sName varchar(32), sNewest varchar(16), sUpdown varchar(16), sUpdown_num varchar(16), sTurnover varchar(32), sAmplitude varchar(16), constraint pk_stocks primary key (sCode))")except:self.cursor.execute("delete from stocks")def closeDB(self):self.con.commit()self.con.close()def insert(self, num, Code, name, newest, updown, updown_num, turnover, amplitude):try:self.cursor.execute("""insert into stocks (sNum, sCode, sName, sNewest, sUpdown, sUpdown_num, sTurnover, sAmplitude) values (?,?,?,?,?,?,?,?)""", (num, code, name, newest, updown, updown_num, turnover, amplitude))except Exception as err:print(err)#更改api参数
urls = [(f"https://push2.eastmoney.com/api/qt/clist/get?np=1&fltt=1&invt=2&cb=jQuery371037824690299744046_1762785231380&fs=m%3A0%2Bt%3A6%2Bf%3A!2%2Cm%3A0%2Bt%3A80%2Bf%3A!2%2Cm%3A1%2Bt%3A2%2Bf%3A!2%2Cm%3A1%2Bt%3A23%2Bf%3A!2%2Cm%3A0%2Bt%3A81%2Bs%3A262144%2Bf%3A!2&fields=f12%2Cf13%2Cf14%2Cf1%2Cf2%2Cf4%2Cf3%2Cf152%2Cf5%2Cf6%2Cf7%2Cf15%2Cf18%2Cf16%2Cf17%2Cf10%2Cf8%2Cf9%2Cf23&fid=f3&"f"pn={page}"    #利用api的pn参数设计分页爬取f"&pz=20&po=1&dect=1&ut=fa5fd1943c7b386f172d6893dbfba10b&wbp2u=%7C0%7C0%7C0%7Cweb&_=1762785231382")for page in range(1, 6)]
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 SLBrowser/9.0.6.8151 SLBChan/111 SLBVPV/64-bit'}def parse_jsonp(jsonp_str):"""解析JSONP格式数据"""try:#找到第一个左括号和最后一个右括号start = jsonp_str.find('(') + 1end = jsonp_str.rfind(')')json_str_clean = jsonp_str[start:end]return json.loads(json_str_clean)except Exception as e:print(f"解析JSONP失败: {e}")return None#创建数据库实例并打开
db = StockDB()
db.openDB()print(f"{'序号':<6}{'代码':<12}{'名称':<12}{'最新价':<8}{'涨跌幅':<8}{'涨跌额':<8}{'成交量(手)':<8}{'振幅':>6}")
num = 0
for url in urls:resp = requests.get(url, headers=headers)resp.encoding = 'utf-8'json_str = resp.textstructure_data = parse_jsonp(json_str)#提取信息if 'data' in structure_data and 'diff' in structure_data['data']:stocks = structure_data['data']['diff']# 打印前几个股票信息for i, stock in enumerate(stocks):num = num + 1code = stock.get('f12')name = stock.get('f14')newest = f"{stock.get('f2')/100:.2f}"updown = f"{stock.get('f3')/100:.2f}%"updown_num = f"{stock.get('f4')/100:.2f}"turnover = f"{stock.get('f5')/10000:.2f}万"amplitude = f"{stock.get('f7')/100:.2f}%"#把数据存入数据库db.insert(str(num), code, name, newest, updown, updown_num, turnover, amplitude)#打印前后各10行if num <= 10 or num >=90:print("%-6d%-12s%-12s%-10s%-10s%-10s%-15s%-10s" % (num, code, name, newest, updown, updown_num, turnover, amplitude))print("... ...")
#关闭数据库
db.closeDB()
print("completed")

运行结果

股票控制台

2)心得体会

用json库解析和提取信息也很方便,但是直接提取的信息和网页上显示的有所不同,还需要耐心进行格式化输出或者添加单位等小操作。学会了利用F12分析网络日志和分析api的参数,并利用api参数来设计爬虫程序。

3)Gitee链接

· https://gitee.com/jh2680513769/2025_crawler_project/blob/master/%E4%BD%9C%E4%B8%9A2/2.py

作业③

1)代码与结果

还是一样先F12分析一下网络日志,找到一个payload.js的文件存放的是大学排行榜的信息:

网页信息

大学网页

核心代码

import requests
import re
import sqlite3class UniversityDB:def openDB(self):self.con = sqlite3.connect("./universities.db")self.cursor = self.con.cursor()try:self.cursor.execute("create table universities(uRank varchar(16), uName varchar(64), uProvince varchar(16), uCategory varchar(16), uScore varchar(16), constraint pk_stocks primary key (uName))")except:self.cursor.execute("delete from universities")def closeDB(self):self.con.commit()self.con.close()def insertDB(self, rank, name, province, category, score):try:self.cursor.execute("""insert into universities (uRank, uName, uProvince, uCategory, uScore) values (?,?,?,?,?)""", (rank, name, province, category, score))except Exception as err:print(err)def showDB(self):self.cursor.execute("select * from universities")rows = self.cursor.fetchall()print("\n数据库中的大学排名数据:")print(f"{'排名':<6}{'学校':<20}{'省市':<8}{'类型':<8}{'总分':<8}")for row in rows:print(f"{row[0]:<6}{row[1]:<20}{row[2]:<8}{row[3]:<8}{row[4]:<8}")url = "https://www.shanghairanking.cn/_nuxt/static/1762223212/rankings/bcur/2021/payload.js"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 SLBrowser/9.0.6.8151 SLBChan/111 SLBVPV/64-bit','Referer':'https://www.shanghairanking.cn/rankings/bcur/2021'  #后面被反爬了因为没加Referer}
resp = requests.get(url, headers=headers)
resp.encoding = 'utf-8'
content = resp.text
#正则表达式析取信息
ranks = re.findall(r'ranking:([^,]+)', content)
names = re.findall(r'univNameCn:"([^"]+)"', content)
scores = re.findall(r'score:([^,]+)', content)
provinces = re.findall(r'province:([^,]+)', content)
categorys = re.findall(r'univCategory:([a-zA-Z])', content)
#检查提取数据长度是否一致
if len(names) != len(scores) or len(scores) != len(provinces) or len(provinces) != len(categorys):print("错误,提取信息数量不匹配!")
else:#创建数据库实例并打开db = UniversityDB()db.openDB()#映射词典创建arr1 = 'a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, _, $, aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am, an, ao, ap, aq, ar, as, at, au, av, aw, ax, ay, az, aA, aB, aC, aD, aE, aF, aG, aH, aI, aJ, aK, aL, aM, aN, aO, aP, aQ, aR, aS, aT, aU, aV, aW, aX, aY, aZ, a_, a$, ba, bb, bc, bd, be, bf, bg, bh, bi, bj, bk, bl, bm, bn, bo, bp, bq, br, bs, bt, bu, bv, bw, bx, by, bz, bA, bB, bC, bD, bE, bF, bG, bH, bI, bJ, bK, bL, bM, bN, bO, bP, bQ, bR, bS, bT, bU, bV, bW, bX, bY, bZ, b_, b$, ca, cb, cc, cd, ce, cf, cg, ch, ci, cj, ck, cl, cm, cn, co, cp, cq, cr, cs, ct, cu, cv, cw, cx, cy, cz, cA, cB, cC, cD, cE, cF, cG, cH, cI, cJ, cK, cL, cM, cN, cO, cP, cQ, cR, cS, cT, cU, cV, cW, cX, cY, cZ, c_, c$, da, db, dc, dd, de, df, dg, dh, di, dj, dk, dl, dm, dn, do0, dp, dq, dr, ds, dt, du, dv, dw, dx, dy, dz, dA, dB, dC, dD, dE, dF, dG, dH, dI, dJ, dK, dL, dM, dN, dO, dP, dQ, dR, dS, dT, dU, dV, dW, dX, dY, dZ, d_, d$, ea, eb, ec, ed, ee, ef, eg, eh, ei, ej, ek, el, em, en, eo, ep, eq, er, es, et, eu, ev, ew, ex, ey, ez, eA, eB, eC, eD, eE, eF, eG, eH, eI, eJ, eK, eL, eM, eN, eO, eP, eQ, eR, eS, eT, eU, eV, eW, eX, eY, eZ, e_, e$, fa, fb, fc, fd, fe, ff, fg, fh, fi, fj, fk, fl, fm, fn, fo, fp, fq, fr, fs, ft, fu, fv, fw, fx, fy, fz, fA, fB, fC, fD, fE, fF, fG, fH, fI, fJ, fK, fL, fM, fN, fO, fP, fQ, fR, fS, fT, fU, fV, fW, fX, fY, fZ, f_, f$, ga, gb, gc, gd, ge, gf, gg, gh, gi, gj, gk, gl, gm, gn, go, gp, gq, gr, gs, gt, gu, gv, gw, gx, gy, gz, gA, gB, gC, gD, gE, gF, gG, gH, gI, gJ, gK, gL, gM, gN, gO, gP, gQ, gR, gS, gT, gU, gV, gW, gX, gY, gZ, g_, g$, ha, hb, hc, hd, he, hf, hg, hh, hi, hj, hk, hl, hm, hn, ho, hp, hq, hr, hs, ht, hu, hv, hw, hx, hy, hz, hA, hB, hC, hD, hE, hF, hG, hH, hI, hJ, hK, hL, hM, hN, hO, hP, hQ, hR, hS, hT, hU, hV, hW, hX, hY, hZ, h_, h$, ia, ib, ic, id, ie, if0, ig, ih, ii, ij, ik, il, im, in0, io, ip, iq, ir, is, it, iu, iv, iw, ix, iy, iz, iA, iB, iC, iD, iE, iF, iG, iH, iI, iJ, iK, iL, iM, iN, iO, iP, iQ, iR, iS, iT, iU, iV, iW, iX, iY, iZ, i_, i$, ja, jb, jc, jd, je, jf, jg, jh, ji, jj, jk, jl, jm, jn, jo, jp, jq, jr, js, jt, ju, jv, jw, jx, jy, jz, jA, jB, jC, jD, jE, jF, jG, jH, jI, jJ, jK, jL, jM, jN, jO, jP, jQ, jR, jS, jT, jU, jV, jW, jX, jY, jZ, j_, j$, ka, kb, kc, kd, ke, kf, kg, kh, ki, kj, kk, kl, km, kn, ko, kp, kq, kr, ks, kt, ku, kv, kw, kx, ky, kz, kA, kB, kC, kD, kE, kF, kG, kH, kI, kJ, kK, kL, kM, kN, kO, kP, kQ, kR, kS, kT, kU, kV, kW, kX, kY, kZ, k_, k$, la, lb, lc, ld, le, lf, lg, lh, li, lj, lk, ll, lm, ln, lo, lp, lq, lr, ls, lt, lu, lv, lw, lx, ly, lz, lA, lB, lC, lD, lE, lF, lG, lH, lI, lJ, lK, lL, lM, lN, lO, lP, lQ, lR, lS, lT, lU, lV, lW, lX, lY, lZ, l_, l$, ma, mb, mc, md, me, mf, mg, mh, mi, mj, mk, ml, mm, mn, mo, mp, mq, mr, ms, mt, mu, mv, mw, mx, my, mz, mA, mB, mC, mD, mE, mF, mG, mH, mI, mJ, mK, mL, mM, mN, mO, mP, mQ, mR, mS, mT, mU, mV, mW, mX, mY, mZ, m_, m$, na, nb, nc, nd, ne, nf, ng, nh, ni, nj, nk, nl, nm, nn, no, np, nq, nr, ns, nt, nu, nv, nw, nx, ny, nz, nA, nB, nC, nD, nE, nF, nG, nH, nI, nJ, nK, nL, nM, nN, nO, nP, nQ, nR, nS, nT, nU, nV, nW, nX, nY, nZ, n_, n$, oa, ob, oc, od, oe, of, og, oh, oi, oj, ok, ol, om, on, oo, op, oq, or, os, ot, ou, ov, ow, ox, oy, oz, oA, oB, oC, oD, oE, oF, oG, oH, oI, oJ, oK, oL, oM, oN, oO, oP, oQ, oR, oS, oT, oU, oV, oW, oX, oY, oZ, o_, o$, pa, pb, pc, pd, pe, pf, pg, ph, pi, pj, pk, pl, pm, pn, po, pp, pq, pr, ps, pt, pu, pv, pw, px, py, pz, pA, pB, pC, pD, pE'arr2 = ["", 'false', 'null', 0, "理工", "综合", 'true', "师范", "双一流", "211", "江苏", "985", "农业", "山东", "河南", "河北", "北京", "辽宁", "陕西", "四川", "广东", "湖北", "湖南", "浙江", "安徽", "江西", 1, "黑龙江", "吉林", "上海", 2, "福建", "山西", "云南", "广西", "贵州", "甘肃", "内蒙古", "重庆", "天津", "新疆", "467", "496", "2025,2024,2023,2022,2021,2020", "林业", "5.8", "533", "2023-01-05T00:00:00+08:00", "23.1", "7.3", "海南", "37.9", "28.0", "4.3", "12.1", "16.8", "11.7", "3.7", "4.6", "297", "397", "21.8", "32.2", "16.6", "37.6", "24.6", "13.6", "13.9", "3.3", "5.2", "8.1", "3.9", "5.1", "5.6", "5.4", "2.6", "162", 93.5, 89.4, "宁夏", "青海", "西藏", 7, "11.3", "35.2", "9.5", "35.0", "32.7", "23.7", "33.2", "9.2", "30.6", "8.5", "22.7", "26.3", "8.0", "10.9", "26.0", "3.2", "6.8", "5.7", "13.8", "6.5", "5.5", "5.0", "13.2", "13.3", "15.6", "18.3", "3.0", "21.3", "12.0", "22.8", "3.6", "3.4", "3.5", "95", "109", "117", "129", "138", "147", "159", "185", "191", "193", "196", "213", "232", "237", "240", "267", "275", "301", "309", "314", "318", "332", "334", "339", "341", "354", "365", "371", "378", "384", "388", "403", "416", "418", "420", "423", "430", "438", "444", "449", "452", "457", "461", "465", "474", "477", "485", "487", "491", "501", "508", "513", "518", "522", "528", 83.4, "538", "555", 2021, 11, 14, 10, "12.8", "42.9", "18.8", "36.6", "4.8", "40.0", "37.7", "11.9", "45.2", "31.8", "10.4", "40.3", "11.2", "30.9", "37.8", "16.1", "19.7", "11.1", "23.8", "29.1", "0.2", "24.0", "27.3", "24.9", "39.5", "20.5", "23.4", "9.0", "4.1", "25.6", "12.9", "6.4", "18.0", "24.2", "7.4", "29.7", "26.5", "22.6", "29.9", "28.6", "10.1", "16.2", "19.4", "19.5", "18.6", "27.4", "17.1", "16.0", "27.6", "7.9", "28.7", "19.3", "29.5", "38.2", "8.9", "3.8", "15.7", "13.5", "1.7", "16.9", "33.4", "132.7", "15.2", "8.7", "20.3", "5.3", "0.3", "4.0", "17.4", "2.7", "160", "161", "164", "165", "166", "167", "168", 130.6, 105.5, 2025, "学生、家长、高校管理人员、高教研究人员等", "中国大学排名(主榜)", 25, 13, 12, "全部", "1", "88.0", 5, "2", "36.1", "25.9", "3", "34.3", "4", "35.5", "21.6", "39.2", "5", "10.8", "4.9", "30.4", "6", "46.2", "7", "0.8", "42.1", "8", "32.1", "22.9", "31.3", "9", "43.0", "25.7", "10", "34.5", "10.0", "26.2", "46.5", "11", "47.0", "33.5", "35.8", "25.8", "12", "46.7", "13.7", "31.4", "33.3", "13", "34.8", "42.3", "13.4", "29.4", "14", "30.7", "15", "42.6", "26.7", "16", "12.5", "17", "12.4", "44.5", "44.8", "18", "10.3", "15.8", "19", "32.3", "19.2", "20", "21", "28.8", "9.6", "22", "45.0", "23", "30.8", "16.7", "16.3", "24", "25", "32.4", "26", "9.4", "27", "33.7", "18.5", "21.9", "28", "30.2", "31.0", "16.4", "29", "34.4", "41.2", "2.9", "30", "38.4", "6.6", "31", "4.4", "17.0", "32", "26.4", "33", "6.1", "34", "38.8", "17.7", "35", "36", "38.1", "11.5", "14.9", "37", "14.3", "18.9", "38", "13.0", "39", "27.8", "33.8", "3.1", "40", "41", "28.9", "42", "28.5", "38.0", "34.0", "1.5", "43", "15.1", "44", "31.2", "120.0", "14.4", "45", "149.8", "7.5", "46", "47", "38.6", "48", "49", "25.2", "50", "19.8", "51", "5.9", "6.7", "52", "4.2", "53", "1.6", "54", "55", "20.0", "56", "39.8", "18.1", "57", "35.6", "58", "10.5", "14.1", "59", "8.2", "60", "140.8", "12.6", "61", "62", "17.6", "63", "64", "1.1", "65", "20.9", "66", "67", "68", "2.1", "69", "123.9", "27.1", "70", "25.5", "37.4", "71", "72", "73", "74", "75", "76", "27.9", "7.0", "77", "78", "79", "80", "81", "82", "83", "84", "1.4", "85", "86", "87", "88", "89", "90", "91", "92", "93", "109.0", "94", 235.7, "97", "98", "99", "100", "101", "102", "103", "104", "105", "106", "107", "108", 223.8, "111", "112", "113", "114", "115", "116", 215.5, "119", "120", "121", "122", "123", "124", "125", "126", "127", "128", 206.7, "131", "132", "133", "134", "135", "136", "137", 201, "140", "141", "142", "143", "144", "145", "146", 194.6, "149", "150", "151", "152", "153", "154", "155", "156", "157", "158", 183.3, "169", "170", "171", "172", "173", "174", "175", "176", "177", "178", "179", "180", "181", "182", "183", "184", 169.6, "187", "188", "189", "190", 168.1, 167, "195", 165.5, "198", "199", "200", "201", "202", "203", "204", "205", "206", "207", "208", "209", "210", "212", 160.5, "215", "216", "217", "218", "219", "220", "221", "222", "223", "224", "225", "226", "227", "228", "229", "230", "231", 153.3, "234", "235", "236", 150.8, "239", 149.9, "242", "243", "244", "245", "246", "247", "248", "249", "250", "251", "252", "253", "254", "255", "256", "257", "258", "259", "260", "261", "262", "263", "264", "265", "266", 139.7, "269", "270", "271", "272", "273", "274", 137, "277", "278", "279", "280", "281", "282", "283", "284", "285", "286", "287", "288", "289", "290", "291", "292", "293", "294", "295", "296", "300", 130.2, "303", "304", "305", "306", "307", "308", 128.4, "311", "312", "313", 125.9, "316", "317", 124.9, "320", "321", "Wuyi University", "322", "323", "324", "325", "326", "327", "328", "329", "330", "331", 120.9, 120.8, "Taizhou University", "336", "337", "338", 119.9, 119.7, "343", "344", "345", "346", "347", "348", "349", "350", "351", "352", "353", 115.4, "356", "357", "358", "359", "360", "361", "362", "363", "364", 112.6, "367", "368", "369", "370", 111, "373", "374", "375", "376", "377", 109.4, "380", "381", "382", "383", 107.6, "386", "387", 107.1, "390", "391", "392", "393", "394", "395", "396", "400", "401", "402", 104.7, "405", "406", "407", "408", "409", "410", "411", "412", "413", "414", "415", 101.2, 101.1, 100.9, "422", 100.3, "425", "426", "427", "428", "429", 99, "432", "433", "434", "435", "436", "437", 97.6, "440", "441", "442", "443", 96.5, "446", "447", "448", 95.8, "451", 95.2, "454", "455", "456", 94.8, "459", "460", 94.3, "463", "464", 93.6, "472", "473", 92.3, "476", 91.7, "479", "480", "481", "482", "483", "484", 90.7, 90.6, "489", "490", 90.2, "493", "494", "495", 89.3, "503", "504", "505", "506", "507", 87.4, "510", "511", "512", 86.8, "515", "516", "517", 86.2, "520", "521", 85.8, "524", "525", "526", "527", 84.6, "530", "531", "532", "537", 82.8, "540", "541", "542", "543", "544", "545", "546", "547", "548", "549", "550", "551", "552", "553", "554", 78.1, "557", "558", "559", "560", "561", "562", "563", "564", "565", "566", "567", "568", "569", "570", "571", "572", "573", "574", "575", "576", "577", "578", "579", "580", "581", "582", 4, "2025-04-15T00:00:00+08:00", "logo\u002Fannual\u002Fbcur\u002F2025.png", "软科中国大学排名于2015年首次发布,多年来以专业、客观、透明的优势赢得了高等教育领域内外的广泛关注和认可,已经成为具有重要社会影响力和权威参考价值的中国大学排名领先品牌。软科中国大学排名以服务中国高等教育发展和进步为导向,采用数百项指标变量对中国大学进行全方位、分类别、监测式评价,向学生、家长和全社会提供及时、可靠、丰富的中国高校可比信息。", 2024, 2023, 2022, 15, 2020, 2019, 2018, 2017, 2016, 2015]arr2_str = [str(i) for i in arr2]dict_map = dict(zip(arr1.split(', '), arr2_str))print(f"爬取中国大学排名(主榜)共{len(names)}所学校")print(f"{'排名':<6}{'学校':^20}{'省市':>8}{'类型':>8}{'总分':>8}")for i in range(len(names)):#使用get方法,如果找不到对应的映射就使用原值rank = dict_map.get(ranks[i], ranks[i])name = names[i]province = dict_map.get(provinces[i], provinces[i])category = dict_map.get(categorys[i], categorys[i])score = dict_map.get(scores[i], scores[i])#把数据存入数据库db.insertDB(rank, name, province, category, score)if int(rank) <=10 or int(rank) >= 570:print(f"{rank:<6}{name:^20}{province:>8}{category:>8}{score:>8}")elif int(rank) == 11:print("... ...")db.closeDB()print("所有数据已保存至数据库niversities.db,任务完成!")

中途,可能因为调试代码的过程反复请求网页,被反爬了,添加上Referer之后还能继续访问。下回还是试试先把文件保存本地的方法吧。

大学反爬

运行结果

大学控制台

F12调试分析Gif

大学分析4

2)心得体会

原本的数据文件不是标准的json文件,直接用json解析行不通。前面尝试了几种爬取方法,查看数据库保存数据时发现,总是有一些表格内容是奇怪的字符比如‘aB’‘jJ’等等,后面重新观察数据文件结构之后才知道字符串是代表了某个值,应该创建一个字典来映射,就不会有奇怪的字符。总结就是,不能太急于求成,应该先留心观察原始数据文件,把数据文件的格式结构等摸清之后再爬虫。

3)Gitee链接

· https://gitee.com/jh2680513769/2025_crawler_project/blob/master/%E4%BD%9C%E4%B8%9A2/3.py

http://www.jsqmd.com/news/37773/

相关文章:

  • 2025-11-11 早报新闻
  • 详细介绍:Spring Boot
  • echarts获取坐标上的点距离顶部底部高度
  • K8S(九)—— Kubernetes持久化存储深度解析:从Volume到PV/PVC与StorageClass动态存储 - 教程
  • JAVA 随机函数
  • GPIO 也是一个接口,还有 QEMU GPIODEV 和 GUSE - 指南
  • Air780EPM系列低功耗模组USB设计进阶:硬件要点与LuatOS API开发赋能
  • 如何项目管理软件中计算预算?
  • Kimi会员双11砍价成功!0.99元首月链接分享
  • 实用指南:【Qt】9.信号和槽_信号和槽存在的意义
  • DI依赖注入
  • 解码LVGL定时器
  • ORACLE解析游标生成JSON
  • 习题解析之:鸡兔同笼
  • 如何选择锡林郭勒西林瓶灌装旋盖机?环境温湿度要求详解
  • DeepSeek权威测评榜单2025年11月最新geo优化公司推荐
  • ECB33-PGB2N4E32-I单板机智能交通监控应用方案解析
  • 北京GEO优化服务商2025权威推荐:抢占AI搜索流量新入口
  • 雅思报班哪个机构比较好?过来人分享选择经验与价格课程对比
  • 深入解析:第三方课题验收测试机构:【API测试工具Apifox使用指南】
  • 云原生周刊丨runc 三大高危漏洞曝光
  • Web Worker 入门指南
  • 鸿蒙NEXT系列之精析NDK UI API(节点增删和属性设置) - 实践
  • 通用cursor rules总结
  • 【JVS更新日志】开源框架升级vue 3、低代码、企业计划、智能BI及其他产品迎来新版本! - 实践
  • 银川西林瓶灌装旋盖机推荐2025,运行稳定连续8小时无故障
  • 【ACM出版 | EI检索稳定】2025年人工智能、业务转型和数据科学创新国际学术会议(ICBTDS 2025)
  • echarts 树形结构图实例
  • pg_hba.conf配置里peer,indent和md5的区别
  • 基于Simulink的双电机PID控制仿真实现方案