Filters¶
By using various filters for Element or Content, you can set the retrieved value to your preferred format.
el = Element(
xpath='//html/body/ul/li',
filter=[
Map(
clean_text,
Normalize(),
Fetch(r'(?P<key>.+): (?P<count>\d+)'),
),
lambda values: {v['key']: v['count'] for v in values},
],
)
Map¶
Execute the filter specified by argument for each element of list or dict.
filter = Map(clean_text, Equals('yes'))
result = filter({
'AAA': ' no ',
'BBB': ' yes ',
'CCC': ' <strong> yes <strong> ',
})
assert {
'AAA': False,
'BBB': True,
'CCC': True,
} == result
It is also possible to call functions defined in the Content class.
class Page(Content):
links = Element(xpath='//a/@href', parser=All(), filter=Map('filter_link'))
def filter_link(self, value):
url = urlparse(value)
return url.netloc
page = Page(xpath='')
result = page.parse('''
<a href="http://google.com">Google</a>
<a href="http://twitter.com">Twitter</a>
<a href="http://facebook.com">Facebook</a>
''')
assert {
'links': [
'google.com',
'twitter.com',
'facebook.com',
]
} == result
Through¶
It returns the passed value as it is. This is the default filter for Element / Content.
assert 10 == through(10)
TakeFirst¶
Get the first element of list. However, if the acquired element is None or ‘’, the next element is acquired.
assert 10 == take_first([None, '', 10])
CleanText¶
Perform the following cleaning process on the character string.
- Removing HTML tags
- Decode HTML special characters
- Make 2 spaces or more of one contiguous space
- Remove Whitespace before and after
clean_text = CleanText()
assert 'aaa & bbb' == clean_text('<p> aaa & bbb </p>')
You can specify how to handle empty values.
clean_text = CleanText(empty_value='empty')
assert 'empty' == clean_text('')
You can also replace the line feed code with a space.
clean_text = CleanText(remove_line_breaks=True)
assert 'a b' == clean_text('a\nb')
Equals¶
Returns True if the value matches the specified string.
equals = Equals('yes')
assert equals('yes')
Contains¶
Returns True if the specified character string is included in the character string.
contains = Contains('B')
assert contains('ABC')
Fetch¶
Extract values from strings using regular expressions.
fetch = Fetch(r'\d+')
assert '100' == fetch('Price: $100')
You can also get all matched values.
fetch = Fetch(r'\d+', all=True)
assert ['100', '20'] == fetch('Price: $100, Amount: 20')
It can also be returned as dict by specifying label.
fetch = Fetch(r'Price: $(?P<price>\d+), Amount: (?P<amount>\d+)')
assert {'price': '100', 'amount': '20'} == fetch('Price: $100, Amount: 20')
Replace¶
You can replace the string using regular expressions.
replace = Replace(r'A+', 'A')
assert 'ABC' == replace('AAAAABC')
Join¶
Returns a string formed by combining list with separator.
join = Join(',')
assert 'A,B,C' == join(['A', 'B', 'C'])
Normalize¶
Returns the normalized string.
normalize = Normalize()
assert '12AB&%' == normalize('12AB&%')
RenameKey¶
Rename the dict’s key.
rename_key = RenameKey({'AAA': 'BBB'})
assert {'BBB': 10} == rename_key({'AAA': 10})
FilterDict¶
Returns dict with only the specified key.
filter_dict = FilterDict(['AAA', 'BBB'])
assert {'AAA': 10, 'BBB': 20} == filter_dict({'AAA': 10, 'BBB': 20, 'CCC': 30})
Other than the specified key can be returned.
filter_dict = FilterDict(['AAA', 'BBB'], ignore=True)
assert {'CCC': 30} == filter_dict({'AAA': 10, 'BBB': 20, 'CCC': 30})
Partial¶
You can execute it by specifying partial arguments to the function.
def add(a, b, c):
return a + b + c
result = Partial(add, kwargs={'a': 10, 'c': 30}, arg_name='b')(20)
assert 60 == result
The Partial filter handles empty values safely, so it is convenient to use it as a wrapper for functions.
assert Partial(int)('') is None
assert Partial(int)('10') == 10
DateTime¶
Converts a Datetime String to a Datetime object.
parse_dt = DateTime()
assert datetime(2001, 2, 3, 4, 5, 6) == parse_dt('2001-02-03 04:05:06')
You can also handle timezone.
parse_dt = DateTime()
result = parse_dt('2001-02-03T04:05:06+09:00')
assert datetime(2001, 2, 3, 4, 5, 6, 0, tzoffset(None, 3600 * 9)) == result
Unnecessary information can be truncated.
parse_dt = DateTime(truncate_timezone=True)
result = parse_dt('2001-02-03T04:05:06+09:00')
assert datetime(2001, 2, 3, 4, 5, 6) == result
parse_dt = DateTime(truncate_time=True)
result = parse_dt('2001-02-03T04:05:06+09:00')
assert date(2001, 2, 3) == result
You can also specify the format.
parse_dt = DateTime(format='%d %m %Y')
result = parse_dt('01 02 2003')
assert datetime(2003, 2, 1) == result
Bool¶
Convert string to Bool type.
parse_bool_string = Bool()
assert parse_bool_string('true')
You can specify a string to treat as True.
parse_bool_string = Bool('OK', 'ok')
assert parse_bool_string('OK')