2 split — Split string variables into parts
You can also specify 1) two or more strings that are alternative separators of “words” and 2)
strings that consist of two or more characters. Alternative strings should be separated by spaces.
Strings that include spaces should be bound by " ". Thus if parse(, " ") is specified, "1,2
3" is also split into "1", "2", and "3". Note particularly the difference between, say, parse(a
b) and parse(ab): with the first, a and b are both acceptable as separators, whereas with the
second, only the string ab is acceptable.
limit(#) specifies an upper limit to the number of new variables to be created. Thus limit(2)
specifies that, at most, two new variables be created.
notrim specifies that the original string variable not be trimmed of leading and trailing spaces before
being parsed. notrim is not compatible with parsing on spaces, because the latter implies that
spaces in a string are to be discarded. You can either specify a parsing character or, by default,
allow a trim.
Destring
destring applies destring to the new string variables, replacing the variables initially created as
strings by numeric variables where possible. See [D] destring.
ignore(), force, float, percent; see [D] destring.
Remarks and examples stata.com
split is used to split a string variable into two or more component parts, for example, “words”.
You might need to correct a mistake, or the string variable might be a genuine composite that you
wish to subdivide before doing more analysis.
The basic steps applied by split are, given one or more separators, to find those separators
within the string and then to generate one or more new string variables, each containing a part of the
original. The separators could be, for example, spaces or other punctuation symbols, but they can in
turn be strings containing several characters. The default separator is a space.
The key string functions for subdividing string variables and, indeed, strings in general, are
strpos(), which finds the position of separators, and substr(), which extracts parts of the string.
(See [D] functions.) split is based on the use of those functions.
If your problem is not defined by splitting on separators, you will probably want to use substr()
directly. Suppose that you have a string variable, date, containing dates in the form "21011952" so
that the last four characters define a year. This string contains no separators. To extract the year, you
would use substr(date,-4,4). Again suppose that each woman’s obstetric history over the last 12
months was recorded by a str12 variable containing values such as "nppppppppbnn", where p, b,
and n denote months of pregnancy, birth, and nonpregnancy. Once more, there are no separators, so
you would use substr() to subdivide the string.
split discards the separators, because it presumes that they are irrelevant to further analysis or
that you could restore them at will. If this is not what you want, you might use substr() (and
possibly strpos()).
Finally, before we turn to examples, compare split with the egen function ends(), which
produces the head, the tail, or the last part of a string. This function, like all egen functions, produces
just one new variable as a result. In contrast, split typically produces several new variables as the
result of one command. For more details and discussion, including comments on the special problem
of recognizing personal names, see [D] egen.