The last installment on getting historical stock data was a bit complicated, so I’m going to keep it simple this time and provide the code.
How to get data from exchanges and how hard it is to do so.
There are several ways to get the data:
Downloading data from an exchange
This is relatively simple, go to the larger exchanges, such as biance okx these will have a section called historical data download, want to have what data, the most granularity is a minute. The finest granularity is one minute. If you want to send an email to get more detailed data, you can download the data you need and then organize it yourself, the whole process is visualized, simple and convenient. The only drawback is that it’s a bit annoying if you do it manually.
Get data through api.
There are three ways to do this.
Download via the library ccxt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
pip install ccxt
import ccxt.async_support as ccxt
binance = ccxt.binance()
binance.load_markets()
import time
delay = 2 # seconds
for symbol in exchange.markets:
print (exchange.fetch_order_book (symbol))
time.sleep (delay) # rate limit
import time
if exchange.has['fetchOHLCV']:
for symbol in exchange.markets:
time.sleep (exchange.rateLimit / 1000) # time.sleep wants seconds
print (symbol, exchange.fetch_ohlcv (symbol, '1d')) # one dayReal trading can run ccxt to get the history of the K line code automatically saved to the database
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
exchange = ccxt.binance()
exchange_str = 'binance'
time_interval: str = '1m'
db_name = 'digicc'
mysql_user = 'digiccy'
mysql_passwd = 'digiccy'
host_name = 'xxx.com'
host_port = 3306
start_time_str = '2017-02-01 00:00:00.000'
start_time = str_to_timestamp(start_time_str)
symbols = get_symbol(exchange_str)
create_table_templates = """
create table templates(
id int(8) AUTO_INCREMENT,
open_time_GMT8 Datetime,
open double,
high double,
low double,
close double,
volume double,
PRIMARY KEY (id)
)
"""
engine_str = f'mysql+pymysql://{mysql_user}:{mysql_passwd}@{host_name}:{host_port}/{db_name}'
engine = create_engine(engine_str)
db = pymysql.connect(host=host_name, port=host_post,
user=mysql_user, passwd=mysql_passwd,
db=db_name, charset="utf8")
con = db.cursor()
con.execute('use'+' ' + db_name)
def table_exists(con, table_name):
sql = "show tables;"
con.execute(sql) # 执行sql语句
tables = [con.fetchall()] # fetchall 返回值是多个元祖
table_list = re.findall('(\'.*?\')', str(tables))
table_list = [re.sub("'", '', each) for each in table_list]
if table_name in table_list:
return 1
else:
return 0
if __name__ == "__main__":
while True:
sql_insert_time_templates = """ insert into templates(open_time_GMT8) values(date_sub(now(), interval 3 hour)) """
engine_str = f'mysql+pymysql://{mysql_user}:{mysql_passwd}@{host_name}:{host_port}/{db_name}'
engine = create_engine(engine_str)
db = pymysql.connect(host=host_name, port=host_post,
user=mysql_user, passwd=mysql_passwd,
db=db_name, charset="utf8")
con = db.cursor()
con.execute('use'+' ' + db_name)
for symbol in symbols[:]:
symbol_str = symbol.replace('/', '_')
symbol_str = symbol_str.replace('-', '_')
exchange_symbol = exchange_str + "_" + symbol_str
sql = create_table_templates.replace('templates', exchange_symbol)
if (table_exists(con, exchange_symbol) != 1):
con.execute(sql)
day_kline_data = exchange.fetch_ohlcv(symbol, timeframe='1d', limit=1000, since=start_time)
day_kline_data_df = pd.DataFrame(day_kline_data)
day_kline_data_df.rename(columns={0: 'MTS', 1: 'open', 2: 'high', 3: 'low', 4: 'close', 5: 'volume'},
inplace=True)
day_kline_data_df['open_time'] = pd.to_datetime(day_kline_data_df['MTS'], unit='ms')
day_kline_data_df['open_time_GMT8'] = day_kline_data_df['open_time'] + timedelta(hours=8)
day_kline_data_df = day_kline_data_df[['open_time_GMT8', 'open', 'high', 'low', 'close', 'volume']]
day_kline_data_df = day_kline_data_df.iloc[1:2]
# print(exchange_symbol)
day_kline_data_df.to_sql(name=exchange_symbol, con=engine, if_exists='append', index=False)
# print(day_kline_data_df.iloc[1:2])
print(exchange_symbol, '初始数据成功写入!')
db.commit()
# exit()
else:
try:
sql_time_templates = "select * from templates order by open_time_GMT8 desc limit 0,1;"
sql_time = sql_time_templates.replace('templates', exchange_symbol)
# df_sql = pd.read_sql(sql_time, db, index_col="id", parse_dates=['open_time_GMT8'])
df_sql = pd.read_sql(sql_time, db, parse_dates=['open_time_GMT8'])
sql_last_time = ' '.join(str(i) for i in df_sql['open_time_GMT8'])
if any(sql_last_time) is False:
sql_del_table_templates = "drop table templastes;"
sql_del_table = sql_del_table_templates.replace('templates', 'exchange_symbol')
con.execute(sql_del_table)
pass
sql_last_time = sql_last_time + '.000'
# get_data_binance_sql(symbol, exchange_str, engine, sql_last_time)
get_data_binance_sql(symbol, exchange_str, exchange, engine, sql_last_time)
except Exception as e:
traceback.print_exc()
print('错误类型是', e.__class__.__name__)
print('错误明细是', e)
print('现在时间', datetime.now(), '休息8小时后继续')
con.close()
db.close()
time.sleep(8*3600)Download via exchange api
1
## See the api documentation for each exchange, which is similar to each other. It is recommended to use rest api to get historical data. During the trading process, we use webstock subscription to get the data, to maximize the resources of the server and yourself.
通过自己编码请求获取数据
现在大部分都转向自有框架下载数据了吧,好处是可以高度自有话,想怎么搞就怎么搞,方便随心,想存文件就存文件。想放数据库 就放数据库。缺点就是数据的清洗整理。校验的责任必须要搞明白,否则,就成了那句名言: 进垃圾,出垃圾。
下面,我提供一套简单的下载代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20from urllib.request import urlopen, Request
import json
import pandas as pd
pd.set_option('expand_frame_repr', False)
from datetime import datetime, timedelta
symbol = 'ltc_usdt'
url = 'https://www.okex.com/api/spot/v3/instruments'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
request = Request(url=url, headers=headers)
content = urlopen(request, timeout=15).read()
content = content.decode('utf-8')
print(content)
json_data = json.loads(content)
df = pd.DataFrame(json_data, dtype='float')
print(df)
Buy commercialized data
Kryptonite is the fastest and most costly way to get accurate data, fast and timely, as long as the money is available, this method is recommended for big guys who don’t need to spend a lot of money.
Pitfalls encountered in acquiring data
Generally you will encounter the following problems:
Data storage
Data storage options generally have local csv, df and other data formats, databases and other options have MySQL, mongogdb faster can also be used in the time series database, I recommend that the intermediate frequency data MySQL and mongodb databases will be enough to use, local data in the use of other terminals is always a little inconvenient!
data cleaning This topic is a big pit to expand slowly say a few points
The missing data, 0 values, differences in the handling of the data is often related to the quality of your backtest data, for example, you get the data from the web page api is not the same as the data you get through the webstock data. How do you handle the difference between the two data. Obviously there are quotes but there are null values in the returned data. You don’t know how this data affects backtesting if you don’t pay attention. Suddenly your server hangs, there is no backup program to ensure the integrity of the data, etc. 3.
digital currency is fine, if it is a stock or options data also face the opening time is not uniform. The issue of pre or post compounding. These are all things that need to be considered when designing the data table structure before data acquisition.
Summary
The most important thing about data is not that it is large and comprehensive, nor is it that it is easy to use, but that the quality of the data should be high Only by laying a firm foundation can you trust your data. And also to further develop your next ideas. Or else. A good idea because of the quality of your data, back to test a bad result, that is not also enough to make a mess.