The last installment on getting historical stock data was a bit complicated, so I’m going to keep it simple this time and provide the code.

How to get data from exchanges and how hard it is to do so.

There are several ways to get the data:

Downloading data from an exchange

This is relatively simple, go to the larger exchanges, such as biance okx these will have a section called historical data download, want to have what data, the most granularity is a minute. The finest granularity is one minute. If you want to send an email to get more detailed data, you can download the data you need and then organize it yourself, the whole process is visualized, simple and convenient. The only drawback is that it’s a bit annoying if you do it manually.

Get data through api.

There are three ways to do this.

  1. Download via the library ccxt

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25

    pip install ccxt


    import ccxt.async_support as ccxt


    binance = ccxt.binance()


    binance.load_markets()

    import time
    delay = 2 # seconds
    for symbol in exchange.markets:
    print (exchange.fetch_order_book (symbol))
    time.sleep (delay) # rate limit


    import time
    if exchange.has['fetchOHLCV']:
    for symbol in exchange.markets:
    time.sleep (exchange.rateLimit / 1000) # time.sleep wants seconds
    print (symbol, exchange.fetch_ohlcv (symbol, '1d')) # one day

    Real trading can run ccxt to get the history of the K line code automatically saved to the database

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125

    exchange = ccxt.binance()
    exchange_str = 'binance'

    time_interval: str = '1m'
    db_name = 'digicc'
    mysql_user = 'digiccy'
    mysql_passwd = 'digiccy'
    host_name = 'xxx.com'
    host_port = 3306
    start_time_str = '2017-02-01 00:00:00.000'

    start_time = str_to_timestamp(start_time_str)
    symbols = get_symbol(exchange_str)


    create_table_templates = """
    create table templates(
    id int(8) AUTO_INCREMENT,
    open_time_GMT8 Datetime,
    open double,
    high double,
    low double,
    close double,
    volume double,
    PRIMARY KEY (id)
    )
    """


    engine_str = f'mysql+pymysql://{mysql_user}:{mysql_passwd}@{host_name}:{host_port}/{db_name}'
    engine = create_engine(engine_str)


    db = pymysql.connect(host=host_name, port=host_post,
    user=mysql_user, passwd=mysql_passwd,
    db=db_name, charset="utf8")

    con = db.cursor()

    con.execute('use'+' ' + db_name)



    def table_exists(con, table_name):
    sql = "show tables;"
    con.execute(sql) # 执行sql语句
    tables = [con.fetchall()] # fetchall 返回值是多个元祖
    table_list = re.findall('(\'.*?\')', str(tables))
    table_list = [re.sub("'", '', each) for each in table_list]
    if table_name in table_list:
    return 1
    else:
    return 0

    if __name__ == "__main__":
    while True:
    sql_insert_time_templates = """ insert into templates(open_time_GMT8) values(date_sub(now(), interval 3 hour)) """
    engine_str = f'mysql+pymysql://{mysql_user}:{mysql_passwd}@{host_name}:{host_port}/{db_name}'
    engine = create_engine(engine_str)

    db = pymysql.connect(host=host_name, port=host_post,
    user=mysql_user, passwd=mysql_passwd,
    db=db_name, charset="utf8")

    con = db.cursor()

    con.execute('use'+' ' + db_name)

    for symbol in symbols[:]:

    symbol_str = symbol.replace('/', '_')
    symbol_str = symbol_str.replace('-', '_')
    exchange_symbol = exchange_str + "_" + symbol_str
    sql = create_table_templates.replace('templates', exchange_symbol)


    if (table_exists(con, exchange_symbol) != 1):
    con.execute(sql)

    day_kline_data = exchange.fetch_ohlcv(symbol, timeframe='1d', limit=1000, since=start_time)
    day_kline_data_df = pd.DataFrame(day_kline_data)
    day_kline_data_df.rename(columns={0: 'MTS', 1: 'open', 2: 'high', 3: 'low', 4: 'close', 5: 'volume'},
    inplace=True)
    day_kline_data_df['open_time'] = pd.to_datetime(day_kline_data_df['MTS'], unit='ms')
    day_kline_data_df['open_time_GMT8'] = day_kline_data_df['open_time'] + timedelta(hours=8)
    day_kline_data_df = day_kline_data_df[['open_time_GMT8', 'open', 'high', 'low', 'close', 'volume']]
    day_kline_data_df = day_kline_data_df.iloc[1:2]
    # print(exchange_symbol)
    day_kline_data_df.to_sql(name=exchange_symbol, con=engine, if_exists='append', index=False)
    # print(day_kline_data_df.iloc[1:2])
    print(exchange_symbol, '初始数据成功写入!')
    db.commit()
    # exit()
    else:
    try:
    sql_time_templates = "select * from templates order by open_time_GMT8 desc limit 0,1;"
    sql_time = sql_time_templates.replace('templates', exchange_symbol)


    # df_sql = pd.read_sql(sql_time, db, index_col="id", parse_dates=['open_time_GMT8'])
    df_sql = pd.read_sql(sql_time, db, parse_dates=['open_time_GMT8'])

    sql_last_time = ' '.join(str(i) for i in df_sql['open_time_GMT8'])

    if any(sql_last_time) is False:
    sql_del_table_templates = "drop table templastes;"
    sql_del_table = sql_del_table_templates.replace('templates', 'exchange_symbol')
    con.execute(sql_del_table)
    pass
    sql_last_time = sql_last_time + '.000'
    # get_data_binance_sql(symbol, exchange_str, engine, sql_last_time)
    get_data_binance_sql(symbol, exchange_str, exchange, engine, sql_last_time)

    except Exception as e:
    traceback.print_exc()
    print('错误类型是', e.__class__.__name__)
    print('错误明细是', e)
    print('现在时间', datetime.now(), '休息8小时后继续')

    con.close()

    db.close()
    time.sleep(8*3600)

  2. Download via exchange api

    1
    ## See the api documentation for each exchange, which is similar to each other. It is recommended to use rest api to get historical data. During the trading process, we use webstock subscription to get the data, to maximize the resources of the server and yourself.
  3. 通过自己编码请求获取数据

    现在大部分都转向自有框架下载数据了吧,好处是可以高度自有话,想怎么搞就怎么搞,方便随心,想存文件就存文件。想放数据库 就放数据库。缺点就是数据的清洗整理。校验的责任必须要搞明白,否则,就成了那句名言: 进垃圾,出垃圾。

    下面,我提供一套简单的下载代码

  4. 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    from urllib.request import urlopen, Request
    import json
    import pandas as pd
    pd.set_option('expand_frame_repr', False)
    from datetime import datetime, timedelta
    symbol = 'ltc_usdt'


    url = 'https://www.okex.com/api/spot/v3/instruments'


    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
    request = Request(url=url, headers=headers)
    content = urlopen(request, timeout=15).read()
    content = content.decode('utf-8')
    print(content)

    json_data = json.loads(content)
    df = pd.DataFrame(json_data, dtype='float')
    print(df)

Buy commercialized data

Kryptonite is the fastest and most costly way to get accurate data, fast and timely, as long as the money is available, this method is recommended for big guys who don’t need to spend a lot of money.

Pitfalls encountered in acquiring data

Generally you will encounter the following problems:

  1. Data storage

    Data storage options generally have local csv, df and other data formats, databases and other options have MySQL, mongogdb faster can also be used in the time series database, I recommend that the intermediate frequency data MySQL and mongodb databases will be enough to use, local data in the use of other terminals is always a little inconvenient!

  2. data cleaning This topic is a big pit to expand slowly say a few points

    The missing data, 0 values, differences in the handling of the data is often related to the quality of your backtest data, for example, you get the data from the web page api is not the same as the data you get through the webstock data. How do you handle the difference between the two data. Obviously there are quotes but there are null values in the returned data. You don’t know how this data affects backtesting if you don’t pay attention. Suddenly your server hangs, there is no backup program to ensure the integrity of the data, etc. 3.

  3. digital currency is fine, if it is a stock or options data also face the opening time is not uniform. The issue of pre or post compounding. These are all things that need to be considered when designing the data table structure before data acquisition.

Summary

The most important thing about data is not that it is large and comprehensive, nor is it that it is easy to use, but that the quality of the data should be high Only by laying a firm foundation can you trust your data. And also to further develop your next ideas. Or else. A good idea because of the quality of your data, back to test a bad result, that is not also enough to make a mess.