标签里:
也爬出来:
comment = selector.xpath('//*[@id="r_content"]/div[1]/div/article/div/p/text()')
好了,最复杂的部分搞定。
三、JSON数据包地址
我们将前三页的数据包地址比对一下就能看出问题:
https://fe-api.zhaopin.com/c/i/sou?pageSize=60&cityId=489&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=%E5%A4%A7%E6%95%B0%E6%8D%AE&kt=3&_v=0.14571817&x-zp-page-request-id=ce8cbb93b9ad4372b4a9e3330358fe7c-1541763191318-555474
https://fe-api.zhaopin.com/c/i/sou?start=60&pageSize=60&cityId=489&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=%E5%A4%A7%E6%95%B0%E6%8D%AE&kt=3&_v=0.14571817&x-zp-page-request-id=ce8cbb93b9ad4372b4a9e3330358fe7c-1541763191318-555474
https://fe-api.zhaopin.com/c/i/sou?start=120&pageSize=60&cityId=489&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=%E5%A4%A7%E6%95%B0%E6%8D%AE&kt=3&_v=0.14571817&x-zp-page-request-id=ce8cbb93b9ad4372b4a9e3330358fe7c-1541763191318-555474
https://fe-api.zhaopin.com/c/i/sou?start=180&pageSize=60&cityId=489&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=%E5%A4%A7%E6%95%B0%E6%8D%AE&kt=3&_v=0.14571817&x-zp-page-request-id=ce8cbb93b9ad4372b4a9e3330358fe7c-1541763191318-555474
1.我们可以看出第一页的url结构与后面的url结构有明显的不同。
2.非首页的url有明显的规律性。
3.'kw=*&kt'里的字符为'大数据'的UTF-8编码。
所以我们对数据包有如下的操作:
if __name__ == '__main__':
key = '大数据'
url = 'https://fe-api.zhaopin.com/c/i/sou?pageSize=60&cityId=489&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=' + key + '&kt=3&lastUrlQuery=%7B%22pageSize%22:%2260%22,%22jl%22:%22489%22,%22kw%22:%22%E5%A4%A7%E6%95%B0%E6%8D%AE%22,%22kt%22:%223%22%7D'
infoUrl(url)
urls = ['https://fe-api.zhaopin.com/c/i/sou?start={}&pageSize=60&cityId=489&kw='.format(i*60)+key+'&kt=3&lastUrlQuery=%7B%22p%22:{},%22pageSize%22:%2260%22,%22jl%22:%22489%22,%22kw%22:%22java%22,%22kt%22:%223%22%7D'.format(i) for i in range(1,50)]
for url in urls:
infoUrl(url)
四、源码结构
1、截取整个结果界面的JSON数据包,从中提取出各个招聘栏的url。
2、进入招聘详细信息页面,提取移动端url。
3、进入移动端界面,抓取需要的信息。
五、源码
'''''
智联招聘——爬虫源码————2018.11
'''
import requests
import re
import time
from lxml import etree
import csv
import random
fp = open('智联招聘.csv','wt',newline='',encoding='UTF-8')
writer = csv.writer(fp)
'''''地区,公司名,学历,岗位描述,薪资,福利,发布时间,工作经验,链接'''
writer.writerow(('职位','公司','地区','学历','岗位','薪资','福利','工作经验','链接'))
def info(url):
res = requests.get(url)
u = re.findall('', res.text)
if len(u) > 0:
u = u[-1]
else:
return
u = 'http:' + u
headers ={
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
res = requests.get(u,headers=headers)
selector = etree.HTML(res.text)
# # 岗位名称
title = selector.xpath('//*[@id="r_content"]/div[1]/div/div[1]/div[1]/h1/text()')
# # 岗位薪资
pay = selector.xpath('//*[@id="r_content"]/div[1]/div/div[1]/div[1]/div[1]/text()')
# # 工作地点
place = selector.xpath('//*[@id="r_content"]/div[1]/div/div[1]/div[3]/div[1]/span[1]/text()')
# # 公司名称
companyName = selector.xpath('//*[@id="r_content"]/div[1]/div/div[1]/div[2]/text()')
# # 学历
edu = selector.xpath('//*[@id="r_content"]/div[1]/div/div[1]/div[3]/div[1]/span[3]/text()')
# # 福利
walfare = selector.xpath('//*[@id="r_content"]/div[1]/div/div[3]/span/text()')
# # 工作经验
siteUrl = res.url
workEx = selector.xpath('//*[@id="r_content"]/div[1]/div/div[1]/div[3]/div[1]/span[2]/text()')
# # 岗位详细
comment = selector.xpath('//*[@id="r_content"]/div[1]/div/article/div/p/text()')
writer.writerow((title, companyName, place, edu, comment, pay, walfare, workEx, siteUrl))
print(title, companyName, place, edu, comment, pay, walfare, workEx, siteUrl)
def infoUrl(url):
res = requests.get(url)
selector = res.json()
code = selector['code']
if code == 200:
data = selector['data']['results']
for i in data:
href = i['positionURL']
info(href)
time.sleep(random.randrange(1,4))
if __name__ == '__main__':
key = '大数据'
url = 'https://fe-api.zhaopin.com/c/i/sou?pageSize=60&cityId=489&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=' + key + '&kt=3&lastUrlQuery=%7B%22pageSize%22:%2260%22,%22jl%22:%22489%22,%22kw%22:%22%E5%A4%A7%E6%95%B0%E6%8D%AE%22,%22kt%22:%223%22%7D'
infoUrl(url)
urls = ['https://fe-api.zhaopin.com/c/i/sou?start={}&pageSize=60&cityId=489&kw='.format(i*60)+key+'&kt=3&lastUrlQuery=%7B%22p%22:{},%22pageSize%22:%2260%22,%22jl%22:%22489%22,%22kw%22:%22java%22,%22kt%22:%223%22%7D'.format(i) for i in range(1,50)]
for url in urls:
infoUrl(url)
Ps.因为某些原因,我打算每个月爬取智联招聘、51job的岗位信息一次,源码、优化都会以博客的形式写出来,欢迎关注~https://www.cnblogs.com/magicxyx/p/9937244.html
