功能描述

每天10:30从网站抓取网页内容，提取需要的数据保存在本地。

网络请求使用urllib.request，网页内容解析使用BeautifulSoup，定时任务使用命令crontab。

crontab

crontab是一个守护进程（也就是不间断运行的进程），它根据调度计划执行命令。
Mac在休眠状态下，crontab能正常运行。

命令格式为:

1 2	* * * * * command // M H D m d command

基本操作：

1
2
3

crontab -e  //编辑
crontab -l  //列出所有cron
crontab -r  //删除

举例:

1
2
3

$ crontab -e
0 8 * * * say hello
# 每天 08:00 说 hello

注意：
python命令需要完整的python path，当然也可以添加crontab的$PATH。

代码

import os
import urllib.request
from bs4 import BeautifulSoup
from datetime import date

url = "http://www.bjjs.gov.cn/bjjs/fwgl/fdcjy/fwjy/index.shtml"
content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(content, "html5lib")
tables = soup.findAll('table')

def tablesOfHome(index):
    arr = list()
    tab = tables[index]
    for tr in tab.findAll('tr'):
        for td in tr.findAll('td'):
            arr.append(td.getText().strip())
    return arr

onlineHome = tablesOfHome(14) #可售住宅套数
newHome = tablesOfHome(15)    #新发布住宅套数
dealHome = tablesOfHome(17)   #住宅签约套数

with open('temp.txt', 'a') as f:
    print('%s  %-10d  %-7d  %-8d' % (date.today(), int(onlineHome[6]), int(newHome[6]), int(dealHome[6])), file = f)