Summary of various string segmentation methods in Python
Today, I will introduce you to various string splitting methods available in Python. They are:
- slpit
- rsplit
- splitlines
- partition
- rpartition
- re.split
Below are detailed introductions to each method.
split(sep=None, maxsplit=-1)
The most common method. This method uses the symbol set by sep to split a string and returns a list of the split elements. You can also specify the maximum number of splits via the maxsplit parameter, which defaults to -1 (meaning no limit). Here are some examples:
By default, the first parameter is the separator sep, the second parameter is the maximum number of splits maxsplit. If both parameters are provided, you don’t need to specify parameter names. If the first parameter sep is kept as default, but the second parameter is not default, you need to specify the parameter name.
split example 1
Without specifying a separator, Python splits on spaces and newline characters (including newline characters from different systems).
## Normal space splitting
normal = "this is a string"
normal.split()
#Out[1]: ['windows', 'string']
## Multiple spaces, automatically merged
twospace = "string  two  space one" 
twospace.split()
#Out[2]: ['string', 'two', 'space', 'one']
## Newline splitting
line = "aothernstring"
line.split()
#Out[5]: ['aother', 'string']
## Windows style newline
wline = "windowsrnstring"
wline.split()
#Out[7]: ['windows', 'string']
split example 2
Custom split string
## Using common separators like commas or periods
normal = "this,is,a,string"
normal.split(",")
# Out[1]: ['this', 'is', 'a', 'string']
## You can also use words or other strings
words = u"我是春江暮客博客博主"
words.split(u"博客")
#Out[1]: ['我是春江暮客', '博主']
split example 3
Specify how many parts to split the string into with maxsplit
## Only split into three parts; note the count starts from 0
spe_len = "this,is,a,string"
spe_len.split(",", 2)
#['this', 'is', 'a,string']
## Default first parameter, custom second parameter requires naming it
spe_len = "this is a string"
spe_len.split(maxsplit=2)
#['this', 'is', 'a string']
rsplit(sep=None, maxsplit=-1)
This function is basically the same as split, but the r means “right”. That is, when limiting the number of splits, it splits from the right side. Examples below:
## Normal space splitting, same as split
normal = "this is a string"
normal.rsplit()
#Out[1]: ['windows', 'string']
## Specifying separator, same as split
normal = "this,is,a,string" 
normal.rsplit(",")
#Out[2]: ['this', 'is', 'a', 'string']
## Specifying split count, different from split
### split
spe_len = "this is a string"
spe_len.split(maxsplit=2)
#['this', 'is', 'a string']
### rsplit
spe_len = "this is a string"
spe_len.rsplit(maxsplit=2)
#Out[2]: ['this is', 'a', 'string']
You can see that in rsplit, the leftover string is placed in the first position 'this is', whereas in split, the leftover is in the last position 'a string'.
splitlines(keepends=False)
Line splitting. This function splits strings by various line separators but does not merge multiple consecutive line separators like split does. The only parameter keepends is a boolean that determines whether to keep the line break characters; default is False.
The line break characters supported include:
["n","r","rn","v","x0b","f","x0c","x1c","x1d","x1e","x85","u2028","u2029"]
Examples:
# Line splitting
s = "我是n春江暮客r博客rn博主"
s.splitlines()
#Out[1]: ['我是', '春江暮客', '博客', '博主']
# Keeping split characters, the separators will be attached to the previous string
s = "我是n春江暮客r博客rn博主" 
s.splitlines(True)
#Out[2]: ['我是n', '春江暮客r', '博客rn', '博主']
partition(sep)
Splits the string at the first occurrence of the separator, returning a tuple of three elements: the part before the separator, the separator itself, and the part after.
sep has no default value; if not specified, it raises an error. If the separator does not appear in the string, the first element is the entire string.
Examples:
# Space partitioning
s = "我是 春江暮客 博客博主"
s.partition(" ")
# Out[3]: ('我是', ' ', '春江暮客 博客博主')
# Carriage return partitioning
s2 = "我是n春江暮客r博客rn博主"
s2.partition("r")
# Out[2]: ('我是n春江暮客', 'r', '博客rn博主')
# No separator specified error
s3 = "我是n春江暮客r博客rn博主"  
s3.partition()
#TypeError: partition() takes exactly one argument (0 given)
# Separator not found
s4 = "我是n春江暮客r博客rn博主"  
s4.partition(",")  
#Out[2]: ('我是n春江暮客r博客rn博主', '', '')
rpartition(sep)
Similar to partition, but searches for the separator from the right side.
## partition
s3 = "我是n春江暮客r博客rn博主"   
s3.partition("r")  
#Out[2]: ('我是n春江暮客', 'r', '博客rn博主')
## rpartition
s3 = "我是n春江暮客r博客rn博主"
s3.rpartition("r")
#Out[3]: ('我是n春江暮客r博客', 'r', 'n博主')
re.split(pattern, string, maxsplit=0, flags=0)
Another method is using the split function provided by the re module. The advantage of regex matching is that you don’t need to specify multiple different characters or digits explicitly. Of course, this comes with a higher overhead.
- patternis the regex pattern to match.
- stringis the string to split.
- maxsplitis the maximum number of splits, defaults to 0 (meaning no limit).
- flagscan be set to- re.IGNORECASEto ignore case.
Examples:
# Import re module
import re
# Split by space directly
s = "我是 春江暮客 博客 博主"
re.split(' ', s)
#Out[1]: ['我是', '春江暮客', '博客', '博主']
# Use regex s for any whitespace character
s = "我是 春江暮客 博客 博主"
re.split('s', s)
#Out[2]: ['我是', '春江暮客', '博客', '博主']
s = "我是 春江暮客n 博客 博主" 
re.split('s', s)
#Out[3]: ['我是', '春江暮客', '', '博客', '博主']
# Specifying max split count
s = "我是 春江暮客 博客 博主"
re.split('s', s, 2)
#Out[2]: ['我是', '春江暮客', '博客 博主']
# Ignore case
## Lowercase match
s =  "hello World pythOn" 
re.split('o', s, 2)
#Out[6]: ['hell', ' W', 'rld pythOn']
## Ignore case match
s =  "hello World pythOn" 
re.split('o', s, 2, re.IGNORECASE)
#Out[7]: ['hell', ' W', 'rld pythOn']
Summary
This article introduced 6 methods available in Python for string splitting, the differences between them, and points to note when using them. Understanding the characteristics of different methods allows more efficient string splitting.
References
- 原文作者:春江暮客
- 原文链接:https://www.bobobk.com/en/852.html
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。