文字列連結の効率の話

追記：
ベンチ追加した

ふと思ったこと。

ret = ''.join([lst])

が速いというのは正しい。
でもそれはjoin関数が速いという局所的な話。

ありがちなコード

lst = []
for i in xrange(10000):
  lst.append("aaa")
ret = ''.join(lst)

string joinを使いたいがためにlistをpythonのforで回したりして作成する。

string joinの中身

static PyObject *
string_join(PyStringObject *self, PyObject *orig)
{
    char *sep = PyString_AS_STRING(self);
    const Py_ssize_t seplen = PyString_GET_SIZE(self);
    PyObject *res = NULL;
    char *p;
    Py_ssize_t seqlen = 0;
    size_t sz = 0;
    Py_ssize_t i;
    PyObject *seq, *item;

    seq = PySequence_Fast(orig, "");
    if (seq == NULL) {
        return NULL;
    }

    seqlen = PySequence_Size(seq);
    if (seqlen == 0) {
        Py_DECREF(seq);
        return PyString_FromString("");
    }
    if (seqlen == 1) {
        item = PySequence_Fast_GET_ITEM(seq, 0);
        if (PyString_CheckExact(item) || PyUnicode_CheckExact(item)) {
            Py_INCREF(item);
            Py_DECREF(seq);
            return item;
        }
    }

    /* There are at least two things to join, or else we have a subclass
     * of the builtin types in the sequence.
     * Do a pre-pass to figure out the total amount of space we'll
     * need (sz), see whether any argument is absurd, and defer to
     * the Unicode join if appropriate.
     */
    for (i = 0; i < seqlen; i++) {
        const size_t old_sz = sz;
        item = PySequence_Fast_GET_ITEM(seq, i);
        if (!PyString_Check(item)){
#ifdef Py_USING_UNICODE
            if (PyUnicode_Check(item)) {
                /* Defer to Unicode join.
                 * CAUTION:  There's no gurantee that the
                 * original sequence can be iterated over
                 * again, so we must pass seq here.
                 */
                PyObject *result;
                result = PyUnicode_Join((PyObject *)self, seq);
                Py_DECREF(seq);
                return result;
            }
#endif
            PyErr_Format(PyExc_TypeError,
                     "sequence item %zd: expected string,"
                     " %.80s found",
                     i, Py_TYPE(item)->tp_name);
            Py_DECREF(seq);
            return NULL;
        }
        sz += PyString_GET_SIZE(item);
        if (i != 0)
            sz += seplen;
        if (sz < old_sz || sz > PY_SSIZE_T_MAX) {
            PyErr_SetString(PyExc_OverflowError,
                "join() result is too long for a Python string");
            Py_DECREF(seq);
            return NULL;
        }
    }

    /* Allocate result space. */
    res = PyString_FromStringAndSize((char*)NULL, sz);
    if (res == NULL) {
        Py_DECREF(seq);
        return NULL;
    }

    /* Catenate everything. */
    p = PyString_AS_STRING(res);
    for (i = 0; i < seqlen; ++i) {
        size_t n;
        item = PySequence_Fast_GET_ITEM(seq, i);
        n = PyString_GET_SIZE(item);
        Py_MEMCPY(p, PyString_AS_STRING(item), n);
        p += n;
        if (i < seqlen - 1) {
            Py_MEMCPY(p, sep, seplen);
            p += seplen;
        }
    }

    Py_DECREF(seq);
    return res;
}

更に2回、サイズ分ループしてますね。
こっちのforはCなので高速ですが。

わざわざlistを作るぐらいならば

from cStringIO import StringIO 
out = StrinIO()
# internal buffer size 128
for i in xrange(10000):
  out.write("aaa")
ret = out.getvalue()

でもいいんじゃないかな。string限定だけど。
listを作成する分のコストとかも下がるはず。

中身はmemcpyなのでほぼjoinと同じはずだし。
（buffer足りないとrealloc入るけど）
cStringIO->writeの中身

static int
O_cwrite(PyObject *self, const char *c, Py_ssize_t  l) {
        Py_ssize_t newl;
        Oobject *oself;
        char *newbuf;

        if (!IO__opencheck(IOOOBJECT(self))) return -1;
        oself = (Oobject *)self;

        newl = oself->pos+l;
        if (newl >= oself->buf_size) {
            oself->buf_size *= 2;
            if (oself->buf_size <= newl) {
            assert(newl + 1 < INT_MAX);
                    oself->buf_size = (int)(newl+1);
        }
            newbuf = (char*)realloc(oself->buf, oself->buf_size);
        if (!newbuf) {
                    PyErr_SetString(PyExc_MemoryError,"out of memory");
                    free(oself->buf);
                    oself->buf = 0;
                    oself->buf_size = oself->pos = 0;
                    return -1;
              }
            oself->buf = newbuf;
          }

        memcpy(oself->buf+oself->pos,c,l);

    assert(oself->pos + l < INT_MAX);
        oself->pos += (int)l;

        if (oself->string_size < oself->pos) {
            oself->string_size = oself->pos;
        }

        return (int)l;
}

また更に言えばpythonのfor文を使ってる時点で遅い。
なのでbuiltinのmap,fliter,anyなどでC側のfor文で処理させるともっと良いのかも知れない。

追記：
id:tokuhiromの指摘どおりStringIOの引数が間違えていたので修正
追記：
ベンチを取ってみた
gist:196450 · GitHub

small string_join x 1000: 0.768462
small string_io x 1000: 1.281988
small map x 1000: 0.941675
small string_join x 10000: 7.401986
small string_io x 10000: 12.536966
small map x 10000: 9.600112
big string_join x 1000: 13.868136
big string_io x 1000: 29.893469
big map x 1000: 14.321785
big string_join x 10000: 137.889503
big string_io x 10000: 292.175437
big map x 10000: 140.072617

mapは関数呼び出し分のコストがかかるせいか遅くなるんだな。
やっぱreallocおきまくりじゃあ激しく遅いですね。
cStringIOはC/API経由だと初期バッファサイズが指定できるのでCで使いたいひとは適切にバッファサイズを指定してから使うといいですよ。
追記：
id:methaneの指摘によりcStringIOの.アクセスを減らす方法に変更

small string_join x 1000: 0.754554
small string_io x 1000: 0.987898
small map x 1000: 1.006020
small string_join x 10000: 7.486376
small string_io x 10000: 9.699732
small map x 10000: 9.743566
big string_join x 1000: 13.559549
big string_io x 1000: 28.973203
big map x 1000: 14.363420
big string_join x 10000: 138.064057
big string_io x 10000: 289.232477
big map x 10000: 141.803601

若干改善されてるけど遅い

Doge log

Abby CTO 雑賀力王のオフィシャルサイトです

文字列連結の効率の話