Filters

By using various filters for Element or Content, you can set the retrieved value to your preferred format.

el = Element(
    xpath='//html/body/ul/li',
    filter=[
        Map(
            clean_text,
            Normalize(),
            Fetch(r'(?P<key>.+): (?P<count>\d+)'),
        ),
        lambda values: {v['key']: v['count'] for v in values},
    ],
)

Map

Execute the filter specified by argument for each element of list or dict.

filter = Map(clean_text, Equals('yes'))
result = filter({
    'AAA': '    no    ',
    'BBB': '    yes    ',
    'CCC': '    <strong>   yes  <strong>    ',
})

assert {
    'AAA': False,
    'BBB': True,
    'CCC': True,
} == result

Through

It returns the passed value as it is. This is the default filter for Element / Content.

assert 10 == through(10)

TakeFirst

Get the first element of list. However, if the acquired element is None or ‘’, the next element is acquired.

assert 10 == take_first([None, '', 10])

CleanText

Perform the following cleaning process on the character string.

  • Removing HTML tags
  • Decode HTML special characters
  • Make 2 spaces or more of one contiguous space
  • Remove Whitespace before and after
assert 'aaa & bbb' == clean_text('<p>  aaa  &amp;  bbb  </p>')

Equals

Returns True if the value matches the specified string.

equals = Equals('yes')
assert equals('yes')

Contains

Returns True if the specified character string is included in the character string.

contains = Contains('B')
assert contains('ABC')

Fetch

Extract values from strings using regular expressions.

fetch = Fetch(r'\d+')
assert '100' == fetch('Price: $100')

You can also get all matched values.

fetch = Fetch(r'\d+', all=True)
assert ['100', '20'] == fetch('Price: $100, Amount: 20')

It can also be returned as dict by specifying label.

fetch = Fetch(r'Price: $(?P<price>\d+), Amount: (?P<amount>\d+)')
assert {'price': '100', 'amount': '20'} == fetch('Price: $100, Amount: 20')

Replace

You can replace the string using regular expressions.

replace = Replace(r'A+', 'A')
assert 'ABC' == replace('AAAAABC')

Join

Returns a string formed by combining list with separator.

join = Join(',')
assert 'A,B,C' == join(['A', 'B', 'C'])

Normalize

Returns the normalized string.

normalize = scrapbook.filters.normalize  # == scrapbook.filters.Normalize('NFKD')
assert '12AB&%' == normalize('12AB&%')

RenameKey

Rename the dict’s key.

rename_key = RenameKey({'AAA': 'BBB'})
assert {'BBB': 10} == rename_key({'AAA': 10})

FilterDict

Returns dict with only the specified key.

filter_dict = FilterDict(['AAA', 'BBB'])
assert {'AAA': 10, 'BBB': 20} == filter_dict({'AAA': 10, 'BBB': 20, 'CCC': 30})

Other than the specified key can be returned.

filter_dict = FilterDict(['AAA', 'BBB'], ignore=True)
assert {'CCC': 30} == filter_dict({'AAA': 10, 'BBB': 20, 'CCC': 30})